ULAPI
8.0
|
A worker that computes the stems of words. For example, the French word "couvent" is both a singular noun and the third person plural form of the verb "couver". Thus, a French ULStemmer object would identify both "couvent n.m." and "couver v." as stems for "couvent". More...
#include <ulstemmer.h>
Public Member Functions | |
ULStemmer () | |
ULStemmer (const ULStemmer &other) | |
virtual | ~ULStemmer () |
ULStemmer & | operator= (const ULStemmer &other) |
void | clear () |
ULDissector * | getDissector () |
void | setDissector (ULDissector *newDissector) |
const ULLanguage & | getLanguage () const |
virtual bool | isServiceAvailable (const ULServiceDescriptor &service) |
virtual void | getAvailableServices (ULList< ULServiceDescriptor > &services) |
virtual void | setCancelOperation (bool shouldCancel) |
ULError | getAllStems (const ULString &surfaceForm, ULList< ULDerivation > &stemList) |
ULError | getStems (const ULString &surfaceForm, ULList< ULDerivation > &stemList) |
ULError | getStems (const ULString &surfaceForm, const ULPartOfSpeechCategory &category, ULList< ULDerivation > &stemList) |
ULError | getFrequencies (const ULString &surfaceForm, ULList< ULFrequency > &frequencyList) |
Public Member Functions inherited from ULWorker | |
ULWorker () | |
virtual | ~ULWorker () |
virtual bool | shouldCancelOperation () const |
A worker that computes the stems of words. For example, the French word "couvent" is both a singular noun and the third person plural form of the verb "couver". Thus, a French ULStemmer object would identify both "couvent n.m." and "couver v." as stems for "couvent".
ULStemmer::ULStemmer | ( | ) |
Default constructor.
ULStemmer::ULStemmer | ( | const ULStemmer & | other | ) |
Copy constructor.
|
virtual |
Destructor
void ULStemmer::clear | ( | ) |
Sets this stemmer to its default state.
ULError ULStemmer::getAllStems | ( | const ULString & | surfaceForm, |
ULList< ULDerivation > & | stemList | ||
) |
Computes the full list of stems for the specified surface form. For example, if the surface form is "thought", then the stem list will include (thought, thought, noun), (thought, think, verb, past participle), (thought, think, verb, past tense, first person singular), (thought, think, verb, past tense, second person singular), etc.
This list can get long, since some surface forms will play many roles for the same root word (as "thought" does for the verb "think"). To get the list of distinct root words without the repetition (e.g. only one stem for (thought, think, verb)), use getStems instead of getAllStems.
[in] | surfaceForm | The word whose stems are sought. |
[in] | stemList | The list of stems. |
|
virtual |
ULDissector * ULStemmer::getDissector | ( | ) |
ULError ULStemmer::getFrequencies | ( | const ULString & | surfaceForm, |
ULList< ULFrequency > & | frequencyList | ||
) |
ULLanguageDataSource objects may contain frequency data of the form (word, root, part-of-speech, count). These data come from manually tagged corpora similar to the American National Corpus or the Penn Treebank.
This method returns a list of frequency objects corresponding to the specified word. (For example, the word "chairs" might yield ("chairs", "chair", verb, 21), ("chairs", "chair", noun, 623), and ("chairs", "chair", unknown, 2).
The method performs its search in a case-insensitive and accent-insensitive way.
[in] | word | the word whose frequencies are sought |
[out] | frequencyList | the corresponding frequencies, sorted in decreasing order of frequency |
const ULLanguage & ULStemmer::getLanguage | ( | ) | const |
ULError ULStemmer::getStems | ( | const ULString & | surfaceForm, |
ULList< ULDerivation > & | stemList | ||
) |
Computes the list of root words for the specified surface form. For example, if the surface form is "thought", then the stem list will consist of (thought, thought, noun) and (thought, think, verb). Note that unlike getAllStems, this method returns exactly one ULDerivation object per root word, and is thus usually more useful and easier to use than getAllStems.
[in] | surfaceForm | The word whose stems are sought. |
[in] | stemList | The list of stems. |
ULError ULStemmer::getStems | ( | const ULString & | surfaceForm, |
const ULPartOfSpeechCategory & | category, | ||
ULList< ULDerivation > & | stemList | ||
) |
Computes the list of root words for the specified surface form, restricted to the specified part of speech category. For example, if the surface form is "thought" and the category is verb, then the stem list will consist of only (thought, think, verb). Note that unlike getAllStems, this method returns exactly one ULDerivation object per root word.
Note that the part of speech category restriction applies to the surface form, not necessarily to the root. For example, the surface form "baker" is a noun, but it can be stemmed to the root "bake", which is a verb. If we call getStems("baker", ULPartOfSpeechCategory::Verb, stemList), we will get no results, since "baker" is not a verb. If we call getStems("baker", ULPartOfSpeechCategory::Noun, stemList), however, we will get the verb "bake" as our stem.
[in] | surfaceForm | The word whose stems are sought. |
[in] | category | The desired part of speech category. |
[in] | stemList | The list of stems. |
|
virtual |
|
virtual |
Setter for the long-operation cancellation boolean attribute.
[in] | set | to true if |
Reimplemented from ULWorker.
void ULStemmer::setDissector | ( | ULDissector * | newDissector | ) |
Sets the ULDissector to be used by this stemmer to perform stemming operations. This ULStemmer does not take responsibility for deleting the dissector. That will need to happen elsewhere (typically the ULFactory will take care of it if your application uses ULFactory to instantiate data sources and workers). param[in] newDissector A pointer to the desired dissector.