Extractor API ... 


This document presents the API (Application Program Interface) for the Extractor DLL (Dynamically Linked Library). This API is designed to allow Extractor to be easily embedded in experimental or commercial software products.

The following table lists the functions that developers can call in their code. The functions are listed in (approximately) the order in which they would usually be called. The demo package contains some sample code, test_api.c, that illustrates how the API can be used. The API is designed for flexibility; it can be used in many different ways, depending on the intended applications.

This API allows several documents to be processed simultaneously, using separate threads for each document. This is useful, for example, when processing web pages.

Number API Function Type Dependencies
1 ExtrCreateDocumentMemory Required None
2 ExtrCreateStopMemory Required None
3 ExtrActivateHighlights Required for Highlights 1
4 ExtrActivateHTMLFilter Optional 1, 2
5 ExtrActivateEmailFilter Optional 1, 2
6 (New) ExtrDeactivateTextFilter Optional 1
7 ExtrSetInputCode Required for Japanese and Korean 1
8 ExtrSetOutputCode Required for Japanese and Korean 1
9 ExtrSetDocumentLanguage Required for Japanese and Korean 1
10 ExtrSetNumberPhrases Optional 1
11 ExtrSetHighlightType Optional 1
12 ExtrAddStopWord Optional 2
13 ExtrAddStopPhrase Optional 2
14 ExtrAddGoPhrase Optional 2
15 ExtrReadDocumentBuffer Required 1, 2, 3, ..., 14
16 ExtrSignalDocumentEnd Required 1, 2, 3, ..., 14, 15
17 ExtrGetPhraseListSize Required for Keyphrases 1, 2, 3, ..., 14, 15, 16
18 ExtrGetPhraseByIndex Required for Keyphrases 1, 2, 3, ..., 14, 15, 16, 17
19 ExtrGetScoreByIndex Optional 1, 2, 3, ..., 14, 15, 16, 17
20 ExtrGetDocumentLanguage Optional 1, 2, 3, ..., 14, 15, 16
21 ExtrGetHighlightListSize Required for Highlights 1, 2, 3, ..., 14, 15, 16
22 ExtrGetHighlightByIndex Required for Highlights 1, 2, 3, ..., 14, 15, 16, 21
23 ExtrGetDocumentProperties Optional 1, 2, 3, ..., 14, 15, 16
24 ExtrGetErrorMessage Optional None
25 ExtrClearDocumentMemory Required 1
26 ExtrClearStopMemory Required 2

API Function: If you click on a function name, you will get a description of the function usage.

Type: The Required functions are functions that must be called for Extractor to work properly. The Optional functions are functions that can be called to override the default settings of Extractor or to get additional information from Extractor. The Required for Japanese functions are required for processing Japanese text, but optional for other languages. The Required for Korean functions are required for processing Korean text, but optional for other languages. The Required for Keyphrases functions are required if you wish to extract key phrases from the text, but optional otherwise. The Required for Highlights functions are required if you wish to extract key sentences from the text, but optional otherwise.

Dependencies: This column of the table shows the order in which functions must be called. For example, function number 12, ExtrAddStopWord, should only be called after function number 2, ExtrCreateStopMemory, has been called. As another example, function number 18, ExtrGetPhraseByIndex, depends on the optional function number 8, ExtrSetOutputCode. This means, although it is not necessary to call function number 8, if you do intend to call function number 8, then you must call it before you call function number 18. For efficiency reasons, Extractor does not verify that the functions are called in the correct order. The programmer is responsible for ensuring that the dependencies are observed in the program that invokes Extractor.


ExtrCreateDocumentMemory

Function header declaration:

int ExtrCreateDocumentMemory(void **DocumentMemory);

Input and output function arguments:

DocumentMemory: output

Example of usage:

void *DocumentMemory;
int ErrorCode;

ErrorCode = ExtrCreateDocumentMemory(&DocumentMemory);

Description:

This function creates a block of memory for storing data about a single document. It returns a pointer value that is a unique identifier for this block of memory. This pointer is later passed to any other functions that process the given document.

A document is processed as a sequence of memory blocks, by calling ExtrReadDocumentBuffer. A typical document will involve multiple calls to ExtrReadDocumentBuffer. Each call updates the state of the memory that is reserved for processing the given document, DocumentMemory.

In a typical application with multiple threads, there will be a one-to-one relationship between threads and DocumentMemory values, and also between DocumentMemory values and individual documents. On the other hand, threads may share StopMemory values, depending on whether it makes sense to use the same stop words and stop phrases for all of the documents that are currently being processed.

The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


ExtrCreateStopMemory

Function header declaration:

int ExtrCreateStopMemory(void **StopMemory);

Input and output function arguments:

StopMemory: output

Example of usage:

void *StopMemory;
int ErrorCode;

ErrorCode = ExtrCreateStopMemory(&StopMemory);

Description:

This function creates a block of memory for storing stop words and stop phrases. It returns a pointer value in StopMemory that is a unique identifier for this block of memory. This pointer is later passed to any other functions that use the stop words or stop phrases.

The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.

A stop word is a word that is not allowed in a keyphrase. For example, "the" is a stop word. A stop phrase is a phrase that is not allowed as a keyphrase. The distinction between a stop word and a single-word stop phrase is that a keyphrase will be rejected if it contains a given stop word, but it will only be rejected if it exactly matches a given stop phrase. For example, if "access" is a stop word, then the phrase "information access" will be rejected. If "access" is a stop phrase, then the phrase "information access" is acceptable, although the single-word phrase "access" will be rejected.

Calling ExtrCreateStopMemory will initialize the stop word list with some standard stop words (including "the", for example). The standard list may be extended by calling ExtrAddStopWord or ExtrAddStopPhrase.


ExtrActivateHighlights

Function header declaration:

int ExtrActivateHighlights(void *DocumentMemory);

Input and output function arguments:

DocumentMemory: input

Example of usage:

void *DocumentMemory;
int ErrorCode;

ErrorCode = ExtrCreateDocumentMemory(&DocumentMemory);
ErrorCode = ExtrActivateHighlights(DocumentMemory);

Description:

A highlight is a key sentence. This function activates the highlight extraction feature for DocumentMemory. By default, it is assumed that the user does not want highlight extraction. ExtrActivateHighlights should be called before any calls to ExtrReadDocumentBuffer, since it will affect how the document is read. The main result of calling ExtrActivateHighlights is that the functions ExtrGetHighlightListSize and ExtrGetHighlightByIndex will return some highlights selected by Extractor.

Extractor attempts to find one key sentence for each keyphrase that it finds. For a given keyphrase, it is possible that Extractor may not be able to find a good example of a sentence that contains the keyphrase. The function ExtrGetHighlightListSize will return the number of highlights that were generated. This number is always less than or equal to the number of keyphrases that were generated, as given by ExtrGetPhraseListSize.

The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


ExtrActivateHTMLFilter

Function header declaration:

int ExtrActivateHTMLFilter(void *DocumentMemory, void *StopMemory);

Input and output function arguments:

DocumentMemory: input
StopMemory:     input

Example of usage:

void *DocumentMemory;
void *StopMemory;
int ErrorCode;

ErrorCode = ExtrCreateStopMemory(&StopMemory);
ErrorCode = ExtrCreateDocumentMemory(&DocumentMemory);
ErrorCode = ExtrActivateHTMLFilter(DocumentMemory, StopMemory);

Description:

This function signals that the document DocumentMemory contains HTML tags. By default, it is assumed that the document does not contain HTML tags. ExtrActivateHTMLFilter should be called before any calls to ExtrReadDocumentBuffer, since it will affect how the document is read. The main result of calling ExtrActivateHTMLFilter is that HTML tags will be parsed. Most tags are ignored, but some tags are used to identify sentence boundaries.

The HTML filter will also convert special symbol codes to the symbols that they represent. For example, "é" will be converted to "é".

The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


ExtrActivateEmailFilter

Function header declaration:

int ExtrActivateEmailFilter(void *DocumentMemory, void *StopMemory);

Input and output function arguments:

DocumentMemory: input
StopMemory:     input

Example of usage:

void *DocumentMemory;
void *StopMemory;
int ErrorCode;

ErrorCode = ExtrCreateStopMemory(&StopMemory);
ErrorCode = ExtrCreateDocumentMemory(&DocumentMemory);
ErrorCode = ExtrActivateEmailFilter(DocumentMemory, StopMemory);

Description:

This function signals that the document DocumentMemory contains an e-mail header. By default, it is assumed that the document does not contain an e-mail header. ExtrActivateEmailFilter should be called before any calls to ExtrReadDocumentBuffer, since it will affect how the document is read. The main result of calling ExtrActivateEmailFilter is that the e-mail header will be ignored, except for the "Subject" field.

Many e-mail gateways cannot handle 8 bit character codes. Often 8 bit character codes will be converted to 7 bit codes, for safe mailing. The e-mail filter will convert MIME quoted-printable 7 bit character codes back to 8 bit codes.

The e-mail filter understands MIME types. E-mail attachments will be treated according to their MIME types. Keyphrases will be extracted from plain text and HTML attachments. Other types of attachments will be ignored. The HTML filter will be automatically activated if the MIME type indicates that the attachment is HTML. Therefore ExtrActivateHTMLFilter should not be called by the user when processing e-mail.

The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.

Note: Activating the e-mail filter with Japanese or Korean text will have no effect. It is not yet supported for Japanese or Korean.


ExtrDeactivateTextFilter

Function header declaration:

int ExtrDeactivateTextFilter(void *DocumentMemory);

Input and output function arguments:

DocumentMemory: input

Example of usage:

void *DocumentMemory;
int ErrorCode;

ErrorCode = ExtrCreateDocumentMemory(&DocumentMemory);
ErrorCode = ExtrDeactivateTextFilter(DocumentMemory);

Description:

This function deactivates the plain text filter for DocumentMemory. By default, when the following conditions are met, the input document is assumed to be plain text:

  • the HTML filter has not been activated
  • the email filter has not been activated
  • the language has not been set to Japanese
  • the language has not been set to Korean
  • When these conditions are met, the plain text filter is activated. The plain text filter will attempt to remove non-textual items from the input document, such as tables and addresses. It will also attempt to use white space to determine the boundaries between titles, section headings, and regular paragraphs. If you do not want the plain text filter to process the input document in these ways, then call ExtrDeactivateTextFilter. Since calling ExtrDeactivateTextFilter will affect how the document is read, it should be called before any calls to ExtrReadDocumentBuffer.

    If the input document contains tabs, the text filter may interpret the lines with tabs as table rows. These lines may be skipped. If you suspect that the text filter is skipping lines that should be processed, then try calling ExtrDeactivateTextFilter.

    Internally, Extractor uses the characters 1D (hex) to mark a phrase boundary and 1E (hex) to mark a sentence boundary. The text filter automatically inserts these characters in a plain text document, by analyzing the white space in the document (i.e., line feeds, blanks, tabs, and carriage returns). For example, if two lines are separated by several line feeds (significant vertical white space), then the text filter will remove the white space and insert a sentence boundary marker. This automatic process works well for most plain text documents, but you may wish to write your own filter for a certain type of input document (e.g., a certain type of word processor file). You can run the document through your own filter program, and then send the resulting plain text to Extractor. In this case, you should call ExtrDeactivateTextFilter, but do not call ExtrActivateHTMLFilter or ExtrActivateEmailFilter. Your filter program can help Extractor by inserting markers for phrase boundaries (1D) and sentence boundaries (1E) in the appropriate places.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrSetInputCode

    Function header declaration:

    int ExtrSetInputCode(void *DocumentMemory, int CharCodeID);
    

    Input and output function arguments:

    DocumentMemory: input
    CharCodeID:     input
    

    Example of usage:

    void *DocumentMemory;
    int ErrorCode;
    int CharCodeID;
    
    CharCodeID = 1;
    
    ErrorCode  = ExtrCreateDocumentMemory(&DocumentMemory);
    ErrorCode  = ExtrSetInputCode(DocumentMemory, CharCodeID);
    
    

    Description:

    A call to ExtrSetInputCode sets the document character code that Extractor uses to process the input text buffer. The character code is given by CharCodeID. ExtrCreateDocumentMemory must be called before ExtrSetInputCode.

    CharCodeID
    Character Code Compatible languages Description
    0
    ISO-8859-1 English, French, German, Spanish ISO-8859-1 is also known as ISO Latin-1.
    1
    MS-DOS English, French, German, Spanish MS-DOS is also known as MS-DOS Code Page 437.
    2
    Unicode UCS2 All Unicode UCS2 double-byte characters, in native byte order.
    3
    Shift-JIS Japanese only SJIS, MS-Kanji, Code Page 932.
    4
    JIS Japanese only New, Old, NEC, ISO-2022-JP.
    5
    EUC-JP Japanese only Extended UNIX Code, Packed Format for Japanese.
    6
    EUC-KR Korean only KS C 5601-1987, KSC5601, Extended UNIX Code, Packed Format for Korean, Code Page 949.
    7
    Johap Korean only Johab, KS X 1001:1992 alternate encoding.

    The supported Japanese character sets for all the Japanese encodings are:

    The supported Korean character sets for all the Korean encodings are:

    ISO-8859-1 and MS-DOS Code Page 437 agree on the coding of non-accented alphabetical characters. If there are no accents in the input text, and the text is in single-byte characters, then the choice between the two should not matter.

    Unicode UCS2 uses double-byte characters. UCS2 is sensitive to the byte ordering of the hardware platform (big endian versus little endian). Extractor handles UCS2 characters using the byte ordering of the hardware for which it is compiled (native byte ordering).

    This function is optional for English, French, German, and Spanish, but required for Japanese and Korean. The default value of CharCodeID is zero.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrSetOutputCode

    Function header declaration:

    int ExtrSetOutputCode(void *DocumentMemory, int CharCodeID);
    

    Input and output function arguments:

    DocumentMemory: input
    CharCodeID:     input
    

    Example of usage:

    void *DocumentMemory;
    int ErrorCode;
    int CharCodeID;
    
    CharCodeID = 1;
    
    ErrorCode  = ExtrCreateDocumentMemory(&DocumentMemory);
    ErrorCode  = ExtrSetOutputCode(DocumentMemory, CharCodeID);
    
    

    Description:

    A call to ExtrSetOutputCode sets the document character code that Extractor uses for the output list of keyphrases. The character code is given by CharCodeID. ExtrCreateDocumentMemory must be called before ExtrSetOutputCode.

    CharCodeID
    Character Code Compatible languages Description
    0
    ISO-8859-1 English, French, German, Spanish ISO-8859-1 is also known as ISO Latin-1.
    1
    MS-DOS English, French, German, Spanish MS-DOS is also known as MS-DOS Code Page 437.
    2
    Unicode UCS2 All Unicode UCS2 double-byte characters, in native byte order.
    3
    Shift-JIS Japanese only SJIS, MS-Kanji, Code Page 932.
    4
    JIS Japanese only New, Old, NEC, ISO-2022-JP.
    5
    EUC-JP Japanese only Extended UNIX Code, Packed Format for Japanese.
    6
    EUC-KR Korean only KS C 5601-1987, KSC5601, Extended UNIX Code, Packed Format for Korean, Code Page 949.
    7
    Johap Korean only Johab, KS X 1001:1992 alternate encoding.

    The supported Japanese character sets for all the Japanese encodings are:

    The supported Korean character sets for all the Korean encodings are:

    ISO-8859-1 and MS-DOS Code Page 437 agree on the coding of non-accented alphabetical characters. If there are no accents in the input text, and the text is in single-byte characters, then the choice between the two should not matter.

    Unicode UCS2 uses double-byte characters. UCS2 is sensitive to the byte ordering of the hardware platform (big endian versus little endian). Extractor handles UCS2 characters using the byte ordering of the hardware for which it is compiled (native byte ordering).

    This function is optional for English, French, German, and Spanish, but required for Japanese and Korean. The default value of CharCodeID is zero.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrSetDocumentLanguage

    Function header declaration:

    int ExtrSetDocumentLanguage(void *DocumentMemory, int LanguageID);
    

    Input and output function arguments:

    DocumentMemory: input
    LanguageID:     input
    

    Example of usage:

    void *DocumentMemory;
    int ErrorCode;
    int LanguageID;
    
    LanguageID = 1;
    
    ErrorCode  = ExtrCreateDocumentMemory(&DocumentMemory);
    ErrorCode  = ExtrSetDocumentLanguage(DocumentMemory, LanguageID);
    
    

    Description:

    A call to ExtrSetDocumentLanguage sets the language that Extractor uses to process the input text buffer. The language is given by LanguageID. ExtrCreateDocumentMemory must be called before ExtrSetDocumentLanguage.

    LanguageID
    Language Description
    0
    Automatic Let Extractor automatically detect the language (for English, French, German, Spanish).
    1
    English Force Extractor to interpret the document as English.
    2
    French Force Extractor to interpret the document as French.
    3
    Japanese Force Extractor to interpret the document as Japanese.
    4
    German Force Extractor to interpret the document as German.
    5
    Spanish Force Extractor to interpret the document as Spanish.
    6
    Korean Force Extractor to interpret the document as Korean.

    This function is optional for English, French, German, and Spanish, but required for Japanese and Korean text. The default value of LanguageID is zero.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrSetNumberPhrases

    Function header declaration:

    int ExtrSetNumberPhrases(void *DocumentMemory, double DesiredNumber);
    

    Input and output function arguments:

    DocumentMemory: input
    DesiredNumber:  input
    

    Example of usage:

    void *DocumentMemory;
    int ErrorCode;
    double DesiredNumber;
    
    DesiredNumber = 9;
    
    ErrorCode  = ExtrCreateDocumentMemory(&DocumentMemory);
    ErrorCode  = ExtrSetNumberPhrases(DocumentMemory, DesiredNumber);
    
    

    Description:

    This function sets the desired number of output phrases. The default number is seven. This is the number that will be generated on average; the actual number of phrases that are output for a given document may be slightly less or slightly more than the number specified by DesiredNumber. Note that DesiredNumber is only set for the given document DocumentMemory. This is so that several documents may be processed simultaneously, each with a different desired number of keyphrases.

    The DesiredNumber must be between 3 and 30. Values outside of this range will be converted to the closest value inside the range. No error message will be generated when values are out of range.

    This function is optional. There is no need to call it unless you wish to override the default value of seven phrases.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrSetHighlightType

    Function header declaration:

    int ExtrSetHighlightType(void *DocumentMemory, int HighlightType);
    

    Input and output function arguments:

    DocumentMemory: input
    HighlightType:  input
    

    Example of usage:

    void *DocumentMemory;
    int ErrorCode;
    int HighlightType;
    
    HighlightType = 1 + 2 + 8;
    
    ErrorCode  = ExtrCreateDocumentMemory(&DocumentMemory);
    ErrorCode  = ExtrActivateHighlights(DocumentMemory);
    ErrorCode  = ExtrSetHighlightType(DocumentMemory, HighlightType);
    
    

    Description:

    A highlight is a key sentence. If ExtrActivateHighlights has been called, then Extractor attempts to find one key sentence for each keyphrase that it finds. The ExtrSetHighlightType function sets the type (i.e., style) of highlight that is generated. The following types of highlights are supported:

    HighlightType
    as integer
    HighlightType
    as bit string
           Description of Type of Highlight
    0
    <00000000>
  • this is the default highlight type
  • sort highlights in same order as keyphrases
  • leave duplicate highlights, to preserve simple mapping between highlights and keyphrases
  • use no HTML markup in the highlights
  • try to trim longer sentences into shorter phrases, using a variety of heuristics
  • 1
    <00000001>
  • remove duplicate sentences from the list of highlights, in those cases where two or more keyphrases have the same corresponding highlight (key sentence)
  • by default (when the highlight type is zero), duplicates are left
  • 2
    <00000010>
  • sort the highlights by order of appearance in the text
  • the default is to sort highlights in the same order as the corresponding keyphrases (keyphrases are sorted in order of decreasing importance)
  • 4
    <00000100>
  • use full sentences
  • do not try to trim the sentences (the default behaviour)
  • 8
    <00001000>
  • markup important words in bold
  • the default is to use no markup
  • 16
    <00010000>
  • markup unimportant words in grey
  • the default is to trim the unimportant words
  • selecting this type will automatically select 4 (use full sentences)
  • These types can be added; for example, type 5 is the combination of types 1 and 4 (duplicates removed, full sentences).

    This function is optional. There is no need to call it unless you wish to override the default value of zero.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrAddStopWord

    Function header declaration:

    int ExtrAddStopWord(void *StopMemory, int LanguageID, int CharCodeID, void *Word);
    

    Input and output function arguments:

    StopMemory: input
    LanguageID: input
    CharCodeID: input
    Word:       input
    

    Example of usage:

    void *StopMemory;
    int ErrorCode;
    
    int LanguageID = 1;
    int CharCodeID = 1;
    char *Word = "the";
    
    ErrorCode = ExtrCreateStopMemory(&StopMemory);
    ErrorCode = ExtrAddStopWord(StopMemory, LanguageID, CharCodeID, (void *) Word);
    

    Description:

    This function adds the string Word to the list of stop words stored in the memory at StopMemory. The stop words are stored in a hash table. It does no harm to try to store the same word twice. It is assumed that Word is in lower case and that Word is a single word (containing no white space).

    Stop words are stored separately for each language. The language is given by LanguageID. ExtrAddStopWord will return a non-zero error code if LanguageID is invalid or if Word contains anything other than lower case characters.

    LanguageID
    Language Description
    1
    English Add the given stop word to the English stop words.
    2
    French Add the given stop word to the French stop words.
    4
    German Add the given stop word to the German stop words.
    5
    Spanish Add the given stop word to the Spanish stop words.
    6
    Korean Add the given stop word to the Korean stop words.

    The character code is given by CharCodeID. Word is of type void * so that either single-byte or double-byte character strings can be passed to this function.

    CharCodeID
    Character Code Language Description
    0
    ISO-8859-1 English, French, German, Spanish ISO-8859-1 is also known as ISO Latin-1.
    1
    MS-DOS English, French, German, Spanish MS-DOS is also known as MS-DOS Code Page 437.
    2
    Unicode UCS2 All Unicode UCS2 double-byte characters, in native byte order.
    6
    EUC-KR Korean only KS C 5601-1987, KSC5601, Extended UNIX Code, Packed Format for Korean, Code Page 949.
    7
    Johap Korean only Johab, KS X 1001:1992 alternate encoding.

    ExtrAddStopWord should be called before any calls to ExtrReadDocumentBuffer, since it will affect how the document is read.

    When the stop word list is first created, by ExtrCreateStopMemory, it is initialized with a list of common stop words. It may not be necessary to add any extra stop words. That is, it may not be necessary to call ExtrAddStopWord.

    A stop word is a word that is not allowed in a keyphrase. For example, "the" is a stop word. A stop phrase is a phrase that is not allowed as a keyphrase. The distinction between a stop word and a single-word stop phrase is that a keyphrase will be rejected if it contains a given stop word, but it will only be rejected if it exactly matches a given stop phrase. For example, if "access" is a stop word, then the phrase "information access" will be rejected. If "access" is a stop phrase, then the phrase "information access" is acceptable, although the single-word phrase "access" will be rejected.

    Note: At this time, you cannot add new stop words for Japanese text. However, you can add new Japanese stop phrases.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrAddStopPhrase

    Function header declaration:

    int ExtrAddStopPhrase(void *StopMemory, int LanguageID, int CharCodeID, void *Phrase);
    

    Input and output function arguments:

    StopMemory: input
    LanguageID: input
    CharCodeID: input
    Phrase:     input
    

    Example of usage:

    void *StopMemory;
    int ErrorCode;
    
    char *Phrase = "access";
    int LanguageID = 1;
    int CharCodeID = 1;
    
    ErrorCode = ExtrCreateStopMemory(&StopMemory);
    ErrorCode = ExtrAddStopPhrase(StopMemory, LanguageID, CharCodeID, (void *) Phrase);
    

    Description:

    This function adds the string Phrase to the list of stop phrases stored in the memory at StopMemory. The stop phrases are stored in a hash table. It does no harm to try to store the same phrase twice. It is assumed that Phrase is in lower case. Phrase may be one, two, or three words, separated by a single space.

    Stop phrases are stored separately for each language. The language is given by LanguageID. ExtrAddStopPhrase will return a non-zero error code if LanguageID is invalid or if Phrase contains anything other than lower case characters and spaces.

    LanguageID
    Language Description
    1
    English Add the given stop phrase to the English stop phrases.
    2
    French Add the given stop phrase to the French stop phrases.
    3
    Japanese Add the given stop phrase to the Japanese stop phrases.
    4
    German Add the given stop phrase to the German stop phrases.
    5
    Spanish Add the given stop phrase to the Spanish stop phrases.
    6
    Korean Add the given stop phrase to the Korean stop phrases.

    The character code is given by CharCodeID. Phrase is of type void * so that either single-byte or double-byte character strings can be passed to this function.

    CharCodeID
    Character Code Language Description
    0
    ISO-8859-1 English, French, German, Spanish ISO-8859-1 is also known as ISO Latin-1.
    1
    MS-DOS English, French, German, Spanish MS-DOS is also known as MS-DOS Code Page 437.
    2
    Unicode UCS2 All Unicode UCS2 double-byte characters, in native byte order.
    3
    Shift-JIS Japanese only SJIS, MS-Kanji, Code Page 932.
    4
    JIS Japanese only New, Old, NEC, ISO-2022-JP.
    5
    EUC-JP Japanese only Extended UNIX Code, Packed Format for Japanese.
    6
    EUC-KR Korean only KS C 5601-1987, KSC5601, Extended UNIX Code, Packed Format for Korean, Code Page 949.
    7
    Johap Korean only Johab, KS X 1001:1992 alternate encoding.

    The supported Japanese character sets for all the Japanese encodings are:

    The supported Korean character sets for all the Korean encodings are:

    When the stop phrase list is first created, by ExtrCreateStopMemory, it is initialized with a list of common stop phrases. It may not be necessary to add any extra stop phrases. That is, it may not be necessary to call ExtrAddStopPhrase.

    A stop word is a word that is not allowed in a keyphrase. For example, "the" is a stop word. A stop phrase is a phrase that is not allowed as a keyphrase. The distinction between a stop word and a single-word stop phrase is that a keyphrase will be rejected if it contains a given stop word, but it will only be rejected if it exactly matches a given stop phrase. For example, if "access" is a stop word, then the phrase "information access" will be rejected. If "access" is a stop phrase, then the phrase "information access" is acceptable, although the single-word phrase "access" will be rejected.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrAddGoPhrase

    Function header declaration:

    int ExtrAddGoPhrase(void *StopMemory, int LanguageID, 
                        int CharCodeID, void *Phrase, int MatchType);
    

    Input and output function arguments:

    StopMemory: input
    LanguageID: input
    CharCodeID: input
    Phrase:     input
    MatchType:  input
    

    Example of usage:

    void *StopMemory;
    int ErrorCode;
    
    char *Phrase = "National Research Council";
    int LanguageID = 1;
    int CharCodeID = 1;
    int MatchType  = 3;
    
    ErrorCode = ExtrCreateStopMemory(&StopMemory);
    ErrorCode = ExtrAddGoPhrase(StopMemory, LanguageID, CharCodeID, 
                    (void *) Phrase, MatchType);
    

    Description:

    If the input document was found by issuing a query to a search engine, the user may have a special interest in whether the query terms appear in the document, and the context in which the query terms appear. This can be achieved by calling the function ExtrAddGoPhrase with each of the terms in the query.

    This function adds the string Phrase to the list of go phrases stored in the memory at StopMemory. A go phrase is a phrase that will be treated as if it were a key phrase, if it appears in the input document. Go phrases are stored in a list and each sentence in the input document is scanned for each go phrase in the list. This has two important implications: (1) A large list of go phrases may slow the execution of Extractor. (2) A go phrase in the input document will not be detected if it spans a sentence boundary.

    A go phrase may consist of one or more words or fragments of words. Any character sequence is permitted, except for an empty string. The letters may be in upper or lower case. A go phrase may range from a single character to a full sentence. A go phrase may contain punctuation.

    Go phrases are stored separately for each language. The language is given by LanguageID. ExtrAddGoPhrase will return a non-zero error code if LanguageID is invalid or if CharCodeID is not compatible with LanguageID.

    The following types of matches are supported:

    MatchType
    as integer
    MatchType
    as bit string
           Description of MatchType
    0
    <00000000>
  • any case: match go phrases to sentences without regard to upper case or lower case
  • any place: match go phrases to sentences without regard to whether the phrase matches a whole word or a fragment of a word
  • any width: for Japanese and Korean, match go phrases to sentences without regard to the character width (fullwidth or halfwidth)
  • any accent: match go phrases to sentences without regard to accents (e.g., "role" will match with "rôle")
  • single substring: treat the go phrase as a single substring (e.g., "ski holiday" can match with "ski holidays", but not with "skiing holidays")
  • exact spaces: a go phrase will only match a sentence with the same spacing as the go phrase (e.g., "broadband" will not match with "broad band")
  • exact sound marks: for Japanese, a go phrase will only match a sentence with the same sound marks as the go phrase (the prolonged sound mark, the voiced sound mark (dakuten), and the semi-voiced sound mark (handakuten))
  • 1
    <00000001>
  • exact case: only match a go phrase to a sentence if the corresponding letters have the same case (upper case or lower case)
  • setting this bit will disable any case (see <00000000>)
  • 2
    <00000010>
  • whole word: only match a go phrase to a sentence if the boundaries of the match do not cut across any words (e.g., "science" will not match with "sciences")
  • setting this bit will disable any place (see <00000000>)
  • 4
    <00000100>
  • exact width: for Japanese and Korean, only match a go phrase to a sentence if the corresponding letters have the same width (fullwidth or halfwidth)
  • setting this bit will disable any width (see <00000000>)
  • 8
    <00001000>
  • exact accent: only match a go phrase to a sentence if the corresponding letters have the same accents (e.g., "role" will not match with "rôle")
  • setting this bit will disable any accent (see <00000000>)
  • 16
    <00010000>
  • multiple substrings: if the go phrase contains several words or fragments of words, then each word in the go phrase can be a substring in a corresponding series of words in a sentence (e.g., "ski holiday" can match with "skiing holidays")
  • if multiple substrings and whole word are both selected, then they will be applied as two separate operations (e.g., first try whole word and single substring, then try any place and multiple substrings), rather than one simultaneous operation
  • if multiple substrings and ignore spaces are both selected, then they will be applied as two separate operations, rather than one simultaneous operation
  • this type of matching is particularly helpful for Korean, but it can also be useful for other languages
  • setting this bit will disable single substring (see <00000000>)
  • 32
    <00100000>
  • ignore spaces: when matching a go phrase to a sentence, ignore spaces in the go phrase and the sentence (e.g., "broadband" will match with "broad band" and vice versa)
  • if ignore spaces and whole word are both selected, then they will be applied as two separate operations (e.g., first try whole word and exact spaces, then try any place and ignore spaces), rather than one simultaneous operation
  • if multiple substrings and ignore spaces are both selected, then they will be applied as two separate operations, rather than one simultaneous operation
  • this type of matching is particularly helpful for German and Korean, where it is common to make new words by combining existing words
  • setting this bit will disable exact spaces (see <00000000>)
  • 64
    <01000000>
  • standardize sound marks: for Japanese, when matching a go phrase to a sentence, first standardize all sound marks (the prolonged sound mark, the voiced sound mark (dakuten), and the semi-voiced sound mark (handakuten))
  • selecting this will also select any width (i.e., match go phrases to sentences without regard to the character width (fullwidth or halfwidth))
  • this type of matching is only useful for Japanese
  • setting this bit will disable exact sound marks (see <00000000>)
  • These types can be added; for example, type 5 is the combination of types 1 and 4 (exact case and exact width). The strictest matching is type 15 (1 + 2 + 4 + 8). The most liberal matching is type 112 (16 + 32 + 64). Type 0 is relatively liberal, but avoids some of the more computationally intensive matching operations. It strikes a balance between liberalness and efficiency. If a given type of matching does not make sense with the given character set (e.g., exact width does not make sense with ISO-8859-1), then it will be ignored. (It won't cause any harm.)

    When go phrases are found in the input document, they will be inserted at the top of the keyphrase list. They will take priority over the regular keyphrases. The length of the keyphrase list will be kept at the value set by ExtrSetNumberPhrases. For each go phrase that is added to the top of the keyphrase list, a regular keyphrase will be deleted from the bottom of the keyphrase list. (Note that Extractor ranks the keyphrases in order of decreasing estimated importance.) A go phrase can be distinguished from a regular keyphrase (a keyphrase generated automatically by Extractor) by its score. All go phrases are given a score of zero, but a regular keyphrase never has a score of zero.

    When a go phrase is found, it is inserted into the keyphrase list in exactly the same form as it was given to ExtrAddGoPhrase. This may be different from the form it has in the input document, depending on MatchType.

    If highlights have been activated (by ExtrActivateHighlights), then each go phrase that is found in the input document will have a corresponding highlight. Extractor attempts to find a good sentence to illustrate each go phrase. If bold markup is set (by ExtrSetHighlightType, then the go phrases will be marked in bold within the corresponding highlights. Neighbouring words and characters may also be marked in bold, if they appear to be closely connected to the go phrase.

    A go phrase might appear in the document, and yet not be found by Extractor. If the go phrase spans a sentence boundary, it will not be detected. For example, "home cooking" will not be found in the text "Pasta is popular in our home. Cooking pasta is easy." Also, if the input document is very long, Extractor may not read the full document, since it should be possible to make a good summary without reading the full text. Therefore, if the go phrase only appears at the end of a very long document, it might not be detected by Extractor. Finally, the number of go phrases that will be found is limited by the desired number of keyphrases, set by ExtrSetNumberPhrases. If the number of go phrases in the input document is greater than the desired number of keyphrases, then the go phrases that appear earlier in the text will be given priority.

    The following languages are supported:

    LanguageID
    Language Description
    1
    English Add the given go phrase to the English go phrases.
    2
    French Add the given go phrase to the French go phrases.
    3
    Japanese Add the given go phrase to the Japanese go phrases.
    4
    German Add the given go phrase to the German go phrases.
    5
    Spanish Add the given go phrase to the Spanish go phrases.
    6
    Korean Add the given go phrase to the Korean go phrases.

    The character code is given by CharCodeID. Phrase is of type void * so that either single-byte or double-byte character strings can be passed to this function.

    CharCodeID
    Character Code Language Description
    0
    ISO-8859-1 English, French, German, Spanish ISO-8859-1 is also known as ISO Latin-1.
    1
    MS-DOS English, French, German, Spanish MS-DOS is also known as MS-DOS Code Page 437.
    2
    Unicode UCS2 All Unicode UCS2 double-byte characters, in native byte order.
    3
    Shift-JIS Japanese only SJIS, MS-Kanji, Code Page 932.
    4
    JIS Japanese only New, Old, NEC, ISO-2022-JP.
    5
    EUC-JP Japanese only Extended UNIX Code, Packed Format for Japanese.
    6
    EUC-KR Korean only KS C 5601-1987, KSC5601, Extended UNIX Code, Packed Format for Korean, Code Page 949.
    7
    Johap Korean only Johab, KS X 1001:1992 alternate encoding.

    The supported Japanese character sets for all the Japanese encodings are:

    The supported Korean character sets for all the Korean encodings are:

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrReadDocumentBuffer

    Function header declaration:

    int ExtrReadDocumentBuffer(void *DocumentMemory, void *StopMemory,
                               void *DocumentBuffer, int BufferLength);
    

    Input and output function arguments:

    DocumentMemory: input
    StopMemory:     input
    DocumentBuffer: input
    BufferLength:   input
    

    Example of usage:

    void *DocumentMemory;
    void *StopMemory;
    int ErrorCode;
    int BufferLength;
    char DocumentBuffer[300];
    
    strcpy(DocumentBuffer, "This is an example of some text.");
    BufferLength = strlen(DocumentBuffer);
    
    ErrorCode = ExtrCreateStopMemory(&StopMemory);
    ErrorCode = ExtrCreateDocumentMemory(&DocumentMemory);
    ErrorCode = ExtrReadDocumentBuffer(DocumentMemory, StopMemory,
                    (void *) DocumentBuffer, BufferLength);
    

    Description:

    This function reads the text in the buffer DocumentBuffer and updates the memory at DocumentMemory. The processing of the buffer is affected by StopMemory.

    In a typical application, there will be a series of calls to ExtrReadDocumentBuffer for a given document DocumentMemory. The idea is that the document is read in chunks. A call to ExtrSignalDocumentEnd signals that the last chunk has been sent (the end of the given document has been reached).

    A call to ExtrReadDocumentBuffer will change the memory at DocumentMemory, but the memory at StopMemory will not be modified. If there are multiple threads, each thread will have a unique value for DocumentMemory, but several threads may share StopMemory.

    The buffer DocumentBuffer may contain single-byte or double-byte characters (see ExtrSetInputCode). This is why it is of type void *. The buffer length BufferLength specifies the number of bytes in the buffer, not the number of characters. When the character code (set by ExtrSetInputCode) indicates double-byte characters, BufferLength must be an even number. That is, the end of the buffer is not allowed to divide a double-byte character into two parts.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrSignalDocumentEnd

    Function header declaration:

    int ExtrSignalDocumentEnd(void *DocumentMemory, void *StopMemory);
    

    Input and output function arguments:

    DocumentMemory: input
    StopMemory:     input
    

    Example of usage:

    void *DocumentMemory;
    void *StopMemory;
    int ErrorCode;
    int BufferLength;
    char DocumentBuffer[300];
    
    strcpy(DocumentBuffer, "This is an example of some text.");
    BufferLength = strlen(DocumentBuffer);
    
    ErrorCode = ExtrCreateStopMemory(&StopMemory);
    ErrorCode = ExtrCreateDocumentMemory(&DocumentMemory);
    ErrorCode = ExtrReadDocumentBuffer(DocumentMemory, StopMemory,
                    (void *) DocumentBuffer, BufferLength);
    
    strcpy(DocumentBuffer, "Here is some more text.");
    BufferLength = strlen(DocumentBuffer);
    
    ErrorCode = ExtrReadDocumentBuffer(DocumentMemory, StopMemory,
                    (void *) DocumentBuffer, BufferLength);
    ErrorCode = ExtrSignalDocumentEnd(DocumentMemory, StopMemory);
    

    Description:

    A call to ExtrSignalDocumentEnd signals that the end of the document has been reached; there will be no further calls to ExtrReadDocumentBuffer with this particular DocumentMemory. This signal triggers the generation of the final list of keyphrases.

    The phrases in the final list of keyphrases are compared with the list of stop phrases in StopMemory and any matching phrases are deleted from the final list of keyphrases. Case is ignored for matching, but otherwise an exact match is required.

    ExtrSignalDocumentEnd should only be called once for a given document DocumentMemory. After ExtrSignalDocumentEnd has been called for a given document, that document has no further need for the stop words and stop phrases stored in StopMemory. Unless there are other documents that will need StopMemory, the memory used by StopMemory may be released after ExtrSignalDocumentEnd has been called.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrGetPhraseListSize

    Function header declaration:

    int ExtrGetPhraseListSize(void *DocumentMemory, int *PhraseListSize);
    

    Input and output function arguments:

    DocumentMemory: input
    PhraseListSize: output
    

    Example of usage:

    void *DocumentMemory;
    void *StopMemory;
    int ErrorCode;
    int BufferLength;
    char DocumentBuffer[300];
    int PhraseListSize;
    
    strcpy(DocumentBuffer, "This is an example of some text.");
    BufferLength = strlen(DocumentBuffer);
    
    ErrorCode = ExtrCreateStopMemory(&StopMemory);
    ErrorCode = ExtrCreateDocumentMemory(&DocumentMemory);
    ErrorCode = ExtrReadDocumentBuffer(DocumentMemory, StopMemory,
                    (void *) DocumentBuffer, BufferLength);
    ErrorCode = ExtrSignalDocumentEnd(DocumentMemory, StopMemory);
    ErrorCode = ExtrGetPhraseListSize(DocumentMemory, &PhraseListSize);
    

    Description:

    The function ExtrGetPhraseListSize returns an integer value that is the number of keyphrases that were generated. If there is an error, PhraseListSize will be set to zero.

    ExtrGetPhraseListSize may be called repeatedly for a given document. It does not modify the memory at DocumentMemory. ExtrGetPhraseListSize should not be called until after ExtrSignalDocumentEnd has been called.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrGetPhraseByIndex

    Function header declaration:

    int ExtrGetPhraseByIndex(void *DocumentMemory, int PhraseIndex, void **Phrase);
    

    Input and output function arguments:

    DocumentMemory: input
    PhraseIndex:    input
    Phrase:         output
    

    Example of usage:

    void *DocumentMemory;
    void *StopMemory;
    int ErrorCode;
    int BufferLength;
    char DocumentBuffer[300];
    int PhraseIndex;
    char *Phrase;
    
    strcpy(DocumentBuffer, "This is an example of some text.");
    BufferLength = strlen(DocumentBuffer);
    
    ErrorCode = ExtrCreateStopMemory(&StopMemory);
    ErrorCode = ExtrCreateDocumentMemory(&DocumentMemory);
    ErrorCode = ExtrReadDocumentBuffer(DocumentMemory, StopMemory,
                    (void *) DocumentBuffer, BufferLength);
    ErrorCode = ExtrSignalDocumentEnd(DocumentMemory, StopMemory);
    
    PhraseIndex = 3;
    
    ErrorCode = ExtrGetPhraseByIndex(DocumentMemory, PhraseIndex, 
                    (void **) &Phrase);
    

    Description:

    A call to ExtrGetPhraseByIndex returns a pointer to a string. The string is phrase number PhraseIndex. PhraseIndex ranges from zero to PhraseListSize minus one. Phrases are approximately in order of decreasing quality. ExtrSignalDocumentEnd must be called before ExtrGetPhraseByIndex.

    The string Phrase may contain single-byte or double-byte characters (see ExtrSetOutputCode). This is why it is of type void **.

    The memory where Phrase is stored will be cleared when ExtrClearDocumentMemory is called. The application should copy Phrase into a more permanent location.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrGetScoreByIndex

    Function header declaration:

    int ExtrGetScoreByIndex(void *DocumentMemory, int PhraseIndex, double *Score);
    

    Input and output function arguments:

    DocumentMemory: input
    PhraseIndex:    input
    Score:          output
    

    Example of usage:

    void *DocumentMemory;
    void *StopMemory;
    int ErrorCode;
    int BufferLength;
    char DocumentBuffer[300];
    int PhraseIndex;
    double Score;
    
    strcpy(DocumentBuffer, "This is an example of some text.");
    BufferLength = strlen(DocumentBuffer);
    
    ErrorCode = ExtrCreateStopMemory(&StopMemory);
    ErrorCode = ExtrCreateDocumentMemory(&DocumentMemory);
    ErrorCode = ExtrReadDocumentBuffer(DocumentMemory, StopMemory,
                    (void *) DocumentBuffer, BufferLength);
    ErrorCode = ExtrSignalDocumentEnd(DocumentMemory, StopMemory);
    
    PhraseIndex = 3;
    
    ErrorCode = ExtrGetScoreByIndex(DocumentMemory, PhraseIndex, &Score);
    

    Description:

    A call to ExtrGetScoreByIndex copies a number into the location given by the pointer. The number is the score assigned to phrase number PhraseIndex. PhraseIndex ranges from zero to PhraseListSize minus one. The score of a phrase is an estimate of its value as a keyphrase. Keyphrases are ranked in order of descending score. ExtrSignalDocumentEnd must be called before ExtrGetScoreByIndex.

    This function is optional. There is no need to call it unless you are curious about the score that is assigned to a phrase.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrGetDocumentLanguage

    Function header declaration:

    int ExtrGetDocumentLanguage(void *DocumentMemory, int *LanguageID);
    

    Input and output function arguments:

    DocumentMemory: input
    LanguageID:     output
    

    Example of usage:

    void *DocumentMemory;
    void *StopMemory;
    int ErrorCode;
    int BufferLength;
    char DocumentBuffer[300];
    int LanguageID;
    
    strcpy(DocumentBuffer, "This is an example of some text.");
    BufferLength = strlen(DocumentBuffer);
    
    ErrorCode = ExtrCreateStopMemory(&StopMemory);
    ErrorCode = ExtrCreateDocumentMemory(&DocumentMemory);
    ErrorCode = ExtrReadDocumentBuffer(DocumentMemory, StopMemory,
                    (void *) DocumentBuffer, BufferLength);
    ErrorCode = ExtrSignalDocumentEnd(DocumentMemory, StopMemory);
    
    ErrorCode = ExtrGetDocumentLanguage(DocumentMemory, &LanguageID);
    
    

    Description:

    A call to ExtrGetDocumentLanguage gets the language of the document. If the language was set by a call to ExtrSetDocumentLanguage, then ExtrGetDocumentLanguage returns the same value that was specified with ExtrSetDocumentLanguage. If Extractor was allowed to guess the language, then ExtrGetDocumentLanguage returns the best guess. LanguageID is passed by reference and is modified in the function.

    LanguageID
    Language Description
    0
    Unknown Extractor was not able to guess, or the language is neither English, French, German, nor Spanish.
    1
    English Extractor guessed English, or English was specified by ExtrSetDocumentLanguage.
    2
    French Extractor guessed French, or French was specified by ExtrSetDocumentLanguage.
    3
    Japanese Japanese was specified by ExtrSetDocumentLanguage.
    4
    German Extractor guessed German, or German was specified by ExtrSetDocumentLanguage.
    5
    Spanish Extractor guessed Spanish, or Spanish was specified by ExtrSetDocumentLanguage.
    6
    Korean Korean was specified by ExtrSetDocumentLanguage.

    This function is optional. There is no need to call it unless you wish to know which language Extractor guessed (English, French, German, or Spanish). Note that language guessing is currently not available for Japanese or Korean.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrGetHighlightListSize

    Function header declaration:

    int ExtrGetHighlightListSize(void *DocumentMemory, int *HighlightListSize);
    

    Input and output function arguments:

    DocumentMemory:    input
    HighlightListSize: output
    

    Example of usage:

    void *DocumentMemory;
    void *StopMemory;
    int ErrorCode;
    int BufferLength;
    char DocumentBuffer[300];
    int HighlightListSize;
    
    strcpy(DocumentBuffer, "This is an example of some text.");
    BufferLength = strlen(DocumentBuffer);
    
    ErrorCode = ExtrCreateStopMemory(&StopMemory);
    ErrorCode = ExtrCreateDocumentMemory(&DocumentMemory);
    ErrorCode = ExtrActivateHighlights(DocumentMemory);
    ErrorCode = ExtrReadDocumentBuffer(DocumentMemory, StopMemory,
                    (void *) DocumentBuffer, BufferLength);
    ErrorCode = ExtrSignalDocumentEnd(DocumentMemory, StopMemory);
    ErrorCode = ExtrGetHighlightListSize(DocumentMemory, &HighlightListSize);
    

    Description:

    The function ExtrGetHighlightListSize returns an integer value that is the number of highlights that were generated. If there is an error, HighlightListSize will be set to zero.

    The number of highlights will be less than or equal to the number of keyphrases. There are two reasons that the number of highlights might be less than the number of keyphrases. First, when HighlightType is an odd number, Extractor removes any duplicate highlights. Second, there may be keyphrases for which no acceptable highlights were found. Therefore, for all values of HighlightType, it cannot be assumed that the highlight list size equals the keyphrase list size.

    ExtrGetHighlightListSize may be called repeatedly for a given document. It does not modify the memory at DocumentMemory. ExtrGetHighlightListSize should not be called until after ExtrSignalDocumentEnd has been called.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrGetHighlightByIndex

    Function header declaration:

    int ExtrGetHighlightByIndex(void *DocumentMemory, int HighlightIndex, void **Highlight);
    

    Input and output function arguments:

    DocumentMemory: input
    HighlightIndex: input
    Highlight:      output
    

    Example of usage:

    void *DocumentMemory;
    void *StopMemory;
    int ErrorCode;
    int BufferLength;
    char DocumentBuffer[300];
    int HighlightIndex;
    char *Highlight;
    
    strcpy(DocumentBuffer, "This is an example of some text.");
    BufferLength = strlen(DocumentBuffer);
    
    ErrorCode = ExtrCreateStopMemory(&StopMemory);
    ErrorCode = ExtrCreateDocumentMemory(&DocumentMemory);
    ErrorCode = ExtrActivateHighlights(DocumentMemory);
    ErrorCode = ExtrReadDocumentBuffer(DocumentMemory, StopMemory,
                    (void *) DocumentBuffer, BufferLength);
    ErrorCode = ExtrSignalDocumentEnd(DocumentMemory, StopMemory);
    
    PhraseIndex = 0;
    
    ErrorCode = ExtrGetHighlightByIndex(DocumentMemory, HighlightIndex, 
                    (void **) &Highlight);
    

    Description:

    A call to ExtrGetHighlightByIndex returns a pointer to a string. The string is highlight number HighlightIndex. HighlightIndex ranges from zero to HighlightListSize minus one. ExtrSignalDocumentEnd must be called before ExtrGetHighlightByIndex.

    The string Highlight may contain single-byte or double-byte characters (see ExtrSetOutputCode). This is why it is of type void **.

    The memory where Highlight is stored will be cleared when ExtrClearDocumentMemory is called. The application should copy Highlight into a more permanent location.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrGetDocumentProperties

    Function header declaration:

    int ExtrGetDocumentProperties(void *DocumentMemory, int PropID, int *PropValue);
    

    Input and output function arguments:

    DocumentMemory: input
    PropID:         input
    PropValue:      output
    

    Example of usage:

    void *DocumentMemory;
    void *StopMemory;
    int ErrorCode;
    int BufferLength;
    char DocumentBuffer[300];
    int PropID;
    int PropValue;
    
    strcpy(DocumentBuffer, "This is an example of some text.");
    BufferLength = strlen(DocumentBuffer);
    
    ErrorCode = ExtrCreateStopMemory(&StopMemory);
    ErrorCode = ExtrCreateDocumentMemory(&DocumentMemory);
    ErrorCode = ExtrReadDocumentBuffer(DocumentMemory, StopMemory,
                    (void *) DocumentBuffer, BufferLength);
    ErrorCode = ExtrSignalDocumentEnd(DocumentMemory, StopMemory);
    
    PropID = 1;
    
    ErrorCode = ExtrGetDocumentProperties(DocumentMemory, PropID, &PropValue);
    
    

    Description:

    A call to ExtrGetDocumentProperties gets various properties of the document. The following properties are currently defined:

    PropID
    Description
    1
    get the number of words that were read
    2
    get the number of non-stop words (content words) that were read
    3
    see whether the whole document was read
    (0 = only the beginning of the document was read; 1 = the whole document was read)

    The desired property is specified by setting PropID. The property value is returned in PropValue.

    The values returned for PropID 1 and 2 depend on the language. For example, a word with an apostrophe counts as two words in French (e.g., "j'ai"), but as one word in English (e.g., "don't"). There are no spaces between words in Japanese, so the values returned for PropID 1 and 2 are rough approximations when the document is in Japanese. If ExtrGetDocumentProperties is called before the language has been determined, the values returned for PropID 1 and 2 will be zero.

    If the document is exceptionally long, Extractor will only read as much of the document as it needs to generate a summary. In this case, PropID 3 will return a value of 0 and PropID 1 and 2 will return values that are less than the actual values for the whole document.

    This function is optional. There is no need to call it unless you wish to know one or more of the above properties. The function may be called multiple times, in order to get multiple properties.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrGetErrorMessage

    Function header declaration:

    void ExtrGetErrorMessage(int ErrorCode, char **ErrorMessage);
    

    Input and output function arguments:

    ErrorCode:    input
    ErrorMessage: output
    

    Example of usage:

    void *StopMemory;
    int ErrorCode;
    char *ErrorMessage;
    
    ErrorCode = ExtrCreateStopMemory(&StopMemory);
    
    if (ErrorCode > 0) {
      ExtrGetErrorMessage(ErrorCode, &ErrorMessage);
      printf("Error %d = %s \n", ErrorCode, ErrorMessage);
    }
    

    Description:

    A call to ExtrGetErrorMessage returns a pointer to a character string. The string will contain a short description of the problem, such as, "ERROR: Memory allocation error. Out of RAM."


    ExtrClearDocumentMemory

    Function header declaration:

    int ExtrClearDocumentMemory(void *DocumentMemory);
    

    Input and output function arguments:

    DocumentMemory: input
    

    Example of usage:

    void *DocumentMemory;
    int ErrorCode;
    
    ErrorCode = ExtrCreateDocumentMemory(&DocumentMemory);
    ErrorCode = ExtrClearDocumentMemory(DocumentMemory);
    

    Description:

    A call to ExtrClearDocumentMemory will free the memory that was allocated for processing a given document.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.


    ExtrClearStopMemory

    Function header declaration:

    int ExtrClearStopMemory(void *StopMemory);
    

    Input and output function arguments:

    StopMemory: input
    

    Example of usage:

    void *StopMemory;
    int ErrorCode;
    
    ErrorCode = ExtrCreateStopMemory(&StopMemory);
    ErrorCode = ExtrClearStopMemory(StopMemory);
    

    Description:

    A call to ExtrClearStopMemory will free the memory that was allocated for stop words and stop phrases.

    The function returns an error code in ErrorCode. If ErrorCode is zero, there are no problems. Otherwise, a call to ExtrGetErrorMessage will return an explanation for the given error code.