CN102081602B

CN102081602B - Method and equipment for determining category of unlisted word

Info

Publication number: CN102081602B
Application number: CN200910252923.5A
Authority: CN
Inventors: 胡长建; 赵凯; 邱立坤
Original assignee: NEC China Co Ltd
Current assignee: NEC China Co Ltd; Renesas Electronics China Co Ltd
Priority date: 2009-11-30
Filing date: 2009-11-30
Publication date: 2014-01-01
Anticipated expiration: 2029-11-30
Also published as: JP5216063B2; JP2011118872A; KR20110060806A; CN102081602A; KR101195341B1

Abstract

The embodiment of the invention discloses a method and equipment for determining the category of an unlisted word. The method can comprise the following steps of: selecting a synonym of the unlisted word from a dictionary based on a word-building rule; generating a context of the unlisted word from a collected works; and determining the category of the unlisted word according to the context and the synonym of the unlisted word. The method and the equipment can be used for more efficiently and accurately determining the category of the unlisted word.

Description

Determine the method and apparatus of the classification of unregistered word

Technical field

The present invention relates generally to field of information processing, particularly for the method and apparatus of the classification of determining unregistered word (unknown word).

Background technology

Along with extensively popularizing and social informationization day by day of internet, text message is more and more, and the social demand of corresponding text information processing is increasing.People more and more wish with the same computer exchange of natural language, and wish to process by the means of robotization the text message of magnanimity.In order to process better text message, people need to accumulate a large amount of language knowledge-bases, for example dictionary.But often by manually compiling, this is very consuming time and poor efficiency as the dictionary of one of important tool of processing text.In addition, in participle technique, greatly have influence on the recall rate of whole participle for the cutting mistake of unregistered word, further can have influence on the accuracy of follow-up grammer and semantic understanding, to information processing, cause certain difficulty.In other information processing technologies, such as information extraction, if clear not to the attribute of unregistered word, the result of information extraction will even mistake of ambiguity occur because of the incompleteness of unregistered word and information thereof so.Therefore to definite problem demanding prompt solution that becomes of the classification of unregistered word.

The open CN1717679 of Chinese patent application discloses a kind of part-of-speech tagging method.The method is that passage is carried out to collective's mark, the keyword that main use records in advance-part of speech storehouse.If comprise specific key in passage, by this section label character, be so just the part of speech that this keyword is corresponding.

U.S. Patent Application Publication US20060100856 A1 discloses a kind of meaning of a word conjecture method.The basic ideas of the method are to be the usage example of each neologisms by this word of Web search extraction, extract meaning of a word class candidate based on example according to existing use-case dictionary, if the candidate exceeds 1, select so one of them and that the highest meaning of a word class of the co-occurrence rate of neologisms under specific language material.

The open CN1369877 of Chinese patent application discloses the method for a new word class conjecture.At first the method determines that for each character in neologisms is separated a probability.Then combine the probability of each character on the part of speech base in order to be that every kind forms a separation general probability.Based on this general probability to a threshold value relatively, probability is increased to the possible classification of this multi-character word over every kind of part of speech of this threshold value.

Xiaofei Lu, in the Hybrid Modelsfor Semantic of NAACL HLT 2007 188-195 pages Classification of Chinese Unknown Words, discloses the part of speech conjecture method that rule, statistical method and based on the context based on manual creation constructed mixed type.Wherein rule and statistical method provide meaning of a word class candidate for the context method.

Chen, H.-H. with C.-C.Lin. in the 2000.Sense-taggingChinese Corpus of In Proceedings of the 2nd ChineseLanguage Processing Workshop 7-14 page, the method that intertranslation by Chinese and English dictionary realizes meaning of a word class mark is disclosed.The method basic process comprises following four steps: 1) provide neologisms, the Chinese and English dictionary based on given is searched all possible translator of English for this word; 2) search corresponding meaning of a word item for all translations from WordNet; 3) inquire about a mapping table, by the meaning of a word item that obtains in step 2 and the meaning of a word label correspondence of Cilin; 4) in the meaning of a word label that the method by word sense disambiguation obtains from step 3, select one as net result.

Yet current technology all can not determined in order to complete the automatic marking problem the classification of unregistered word effectively.Prior art generally all will be good by pre-edit dictionary neologisms are carried out to the part of speech analysis, so the rationality of the annotation results of these class methods depends on the structure of corresponding dictionary or knowledge base, and Performance Ratio is lower.

Therefore, need a kind of efficient technical scheme of determining the classification of unregistered word with superperformance.

Summary of the invention

For above problems of the prior art, one object of the present invention has been to provide a kind of method and apparatus of the classification for definite unregistered word.

According to a first aspect of the invention, provide a kind of for determining the class method for distinguishing of unregistered word.The method can comprise: select the synonym of described unregistered word from dictionary based on word-building rule; Generate the context of described unregistered word from collected works; And, according to context and the described synonym of described unregistered word, determine the classification under described unregistered word.

A kind of equipment of the classification for definite unregistered word is provided according to a second aspect of the invention.This equipment can comprise: the synonym selector switch is configured to select from dictionary based on word-building rule the synonym of described unregistered word; The context maker, be configured to generate from collected works the context of described unregistered word; And the classification determiner, be configured to determine the classification under described unregistered word according to the context of described unregistered word and described synonym.

By following, to the description according to the preferred embodiment of the present invention, and by reference to the accompanying drawings, other features of the present invention and advantage will be apparent.

The accompanying drawing explanation

By below in conjunction with the description of the drawings, and along with understanding more comprehensively of the present invention, other purposes of the present invention and effect will become more clear and easy to understand, wherein:

Fig. 1 is the block diagram according to the equipment of the classification for definite unregistered word of one embodiment of the present of invention;

Fig. 2 is the process flow diagram according to the class method for distinguishing for definite unregistered word of one embodiment of the present of invention;

Fig. 3 is the process flow diagram according to the class method for distinguishing for definite unregistered word of an alternative embodiment of the invention;

Fig. 4 is the process flow diagram according to the class method for distinguishing for definite unregistered word of an alternative embodiment of the invention; And

Fig. 5 is the process flow diagram according to the class method for distinguishing for definite unregistered word of another embodiment of the present invention.

In all above-mentioned accompanying drawings, identical label means to have identical, similar or corresponding feature or function.

Embodiment

Below in conjunction with accompanying drawing, the present invention is explained in more detail and illustrates.Should be appreciated that drawings and Examples of the present invention are only for exemplary effect, not for limiting the scope of the invention.

For the sake of clarity, at first the term used in the present invention is done to explain.

1. dictionary

Dictionary refers to the dictionary of including pending language core vocabulary, and general scale is at 50,000 more than entry, for example, and word woods, HowNet, WordNet etc.Dictionary can comprise one or more words, for each word, can mark the information such as its part of speech, classification, the meaning of a word, example sentence.Table 1 has provided an example of the data structure of dictionary, wherein shows altogether 3 words " Beijing ", " health products ", " happiness ", and each word has part of speech and classification separately.

Table 1

Sequence number	Word	Part of speech	Classification
				1	Beijing	Noun	City
2	Health products	Noun	Material
				3	Happy	Adjective	Emotion
...	...	...	...

2. collected works

Collected works are set of one group of free text, and free text can be sentence, fragment, article etc. and combination in any thereof.

3. word, immediate constituent and word

Word is minimum text unit.For example, in Chinese, " my god ", " I ", " good " be all respectively a word.

Immediate constituent: the subsection that forms a large unit is called the composition of large unit, and correspondingly, the subsection that directly forms a large unit is called the immediate constituent of large unit.The immediate constituent of a word can be morpheme or than this word less word.Such as " Ministry of Science and Technology ", its immediate constituent is " science ", " technology " and " section ".And the immediate constituent of " ice crystal " is " ice " and " crystalline substance ".

The string with certain implication that word is comprised of one or more words.For example, " we " are the words that comprises two words, and " computing machine " is to comprise triliteral word.

4. unregistered word

Unregistered word is the word of not including in current dictionary.

5. classification

Classification can comprise: semantic category and the wider superclass (supersense) than semantic category scope.

Semantic category is such as being " city ", " mood " etc.A semantic category can comprise a plurality of words, and for example word " ”He“ Shanghai, Beijing " can all belong to semantic category " city ".A word can have a plurality of semantic categories, and for example, word " arm " can have " body part " and " personage " these two semantic categories.

Superclass refers to the classification wider than semantic category, such as " place ", " material " etc., and wherein, superclass " place " is wider than the scope of semantic category " city ".

The present invention relates to a kind of for determining the class method for distinguishing of unregistered word.The method can comprise: select the synonym of unregistered word from dictionary based on word-building rule; Generate the context of this unregistered word from collected works; And, according to context and the synonym of this unregistered word, determine the classification under this unregistered word.

According to one embodiment of present invention, can, by select to share the synonym of the word of one or more combining forms as unregistered word with unregistered word from dictionary, complete the synon process of selecting unregistered word based on word-building rule from dictionary.According to another embodiment of the invention, can be by complete the synon process of selecting unregistered word based on word-building rule from dictionary to get off: the part of speech of determining unregistered word; Select to share with unregistered word the word of one or more combining forms from dictionary; And select the word identical with the part of speech of unregistered word in selected word, as the synonym of unregistered word.

According to one embodiment of present invention, can be by complete the contextual process that generates unregistered word from collected works to get off: search unregistered word collected works; Mode with windowing intercepts the word contiguous with unregistered word; The intercepted word with the unregistered word vicinity is carried out to participle; And the weight of determining resulting each word after participle so that will be after participle resulting each word and weight thereof as the context of unregistered word, use.According to another embodiment of the invention, can be by complete the contextual process that generates unregistered word from collected works to get off: search unregistered word collected works; And the dependence of analyzing unregistered word in the mode of dependency tree, so that the context using dependence as unregistered word is used.

According to one embodiment of present invention, determine that according to the context of unregistered word and synonym the process of the classification under unregistered word can comprise: the classification under the statistics synonym; Generate the context of all words that each classification comprises from collected works, as the context of each classification; Similarity between the context of calculating unregistered word and the context of each classification; And classification that will be corresponding with maximum similarity is defined as the classification under unregistered word.According to another embodiment of the invention, determine that according to the context of unregistered word and synonym the process of the classification under unregistered word can comprise: from collected works, generate synon context; Calculate the context of unregistered word and the similarity between synon context; According to the similarity calculated, extract a set from synonym; To be sued for peace with similarity in extracted set, that synonym that belong to identical category is corresponding; And determine the classification under unregistered word according to the similarity after summation.According to another embodiment of the invention, determine that according to the context of unregistered word and synonym the process of the classification under unregistered word can comprise: from collected works, generate synon context; Calculate the context of unregistered word and the similarity between synon context; Classification under the statistics synonym; Receive the predetermined weight factor be associated with synonym; Utilize the predetermined weight factor received, the similarity corresponding to the synonym with being associated is weighted;

According to the similarity after weighting, extract a set from synonym; To be sued for peace with the similarity after in extracted set, that synonym that belong to identical category is corresponding weighting; And determine the classification under unregistered word according to the similarity after summation.

Below will describe each embodiment of the present invention in detail.

Fig. 1 is the block diagram according to the equipment 100 of the classification for definite unregistered word of one embodiment of the present of invention.

The equipment 100 of the classification for definite unregistered word of the present invention can comprise: synonym selector switch 110, context maker 120 and classification determiner 130.Synonym selector switch 110 can be selected based on word-building rule the synonym of unregistered word from dictionary.Context maker 120 can generate from collected works the context of unregistered word.Classification determiner 130 can be determined the classification under unregistered word according to the context of unregistered word and synonym.

According to one embodiment of present invention, synonym selector switch 110 can comprise: for selecting from dictionary, with unregistered word, share the synon device of the word of one or more combining forms as unregistered word.According to one embodiment of present invention, synonym selector switch 110 can comprise: for the device of the part of speech of determining unregistered word; For select to share with unregistered word the device of the word of one or more combining forms from dictionary; And for select the word identical with the part of speech of unregistered word at selected word, as the synon device of unregistered word.

According to one embodiment of present invention, context maker 120 can comprise: for search the device of unregistered word at collected works; Intercept the device of the word contiguous with unregistered word for the mode with windowing; For the intercepted word with the unregistered word vicinity being carried out to the device of participle; And for determining the weight of resulting each word after participle so that will be after participle resulting each word and weight thereof as the contextual device of unregistered word.

According to one embodiment of present invention, context maker 120 can comprise: for search the device of unregistered word at collected works; And the dependence of analyzing unregistered word for the mode with dependency tree, so that the device that the context using dependence as unregistered word is used.

According to one embodiment of present invention, context maker 120 also can comprise for from collected works, generating synon contextual device.

According to one embodiment of present invention, classification determiner 130 can comprise: for adding up the device of the classification under synonym; For generate the contextual device of the context of all words that each classification comprises as each classification from collected works; Device for the similarity between the context that calculates the context of unregistered word and each classification; And the device that is defined as the classification under unregistered word for classification that will be corresponding with maximum similarity.

According to one embodiment of present invention, classification determiner 130 can comprise: for the context that calculates unregistered word and the device of the similarity between synon context; For extract the device of a set from described synonym according to similarity; For the device that will be sued for peace with extracted similarity set, that synonym that belong to identical category is corresponding; And the device of determining the classification under unregistered word for the similarity according to after summation.In one embodiment, the included similarity for according to after summation of classification determiner 130 determines that the device of the classification under unregistered word can carry out the K-nearest neighbor algorithm.

According to one embodiment of present invention, classification determiner 130 can comprise: for the context that calculates unregistered word and the device of the similarity between synon context; For adding up the device of the affiliated classification of synonym; For receiving the device of the predetermined weight factor be associated with synonym; For utilizing the predetermined weight factor of reception, the device that corresponding similarity is weighted to the synonym with being associated; For extract the device of a set from synonym according to similarity; For the device that will be sued for peace with the similarity after extracted set, weighting that synonym that belong to identical category is corresponding; And the device of determining the classification under unregistered word for the similarity according to after summation.In one embodiment, the appointment of predetermined weight factor meets following strategy: if the shared the last character of word and shared penultimate word in unregistered word and a classification, the predetermined weight factor that will be associated with classification is set as λ ₁; Otherwise, if first character and shared the last character shared in the word in unregistered word and classification, the predetermined weight factor that will be associated with classification is set as λ ₂; Otherwise, if the word in unregistered word and classification is only shared first character or only shared the last character, the predetermined weight factor that will be associated with classification is set as λ ₃; Otherwise the predetermined weight factor that will be associated with classification is set as λ ₄, λ wherein ₁>=λ ₂>=λ ₃>=λ ₄.In one embodiment, the included device for according to described similarity, from described synonym, extracting a set of classification determiner 130 can comprise: for device similarity sorted according to size order; And for the synonym corresponding with the similarity of the predetermined number that comes front extracted to the device of this set.

Fig. 2 is the process flow diagram according to the class method for distinguishing for definite unregistered word of one embodiment of the present of invention.

In step 201, select the synonym of unregistered word based on word-building rule from dictionary.

According to one embodiment of present invention, word-building rule can comprise combining form, composition attribute and composition relation.Combining form can comprise the word that forms word and/or immediate constituent etc.; The composition attribute can comprise mark, length, part of speech of word etc.; The composition relation can comprise the relation between each composition of word, relations such as arranged side by side, modification, restriction.

In an example, can select to share with unregistered word from dictionary the word of one or more words and/or immediate constituent the synonym using it as unregistered word.For example, suppose that unregistered word is for " the base people ", this unregistered word comprises two words " base " and " people ".Suppose that the word that comprises " base " this word in dictionary has " basis ", " substantially ", " founder ", " ground ", the word that comprises " people " this word has " people ", " democracy ", these words are all thought to unregistered word " the base people's " synonym, now synonym set={ " basis ", " substantially ", " founder ", " ground ", " people ", " democracy " }.Embodiment shown in Fig. 3 has described this embodiment.

In addition, in another example, also can at first determine the part of speech of unregistered word, such as noun, adjective or verb etc., select the word identical with the part of speech of unregistered word, the synonym using selected word as unregistered word the word of sharing one or more words and/or immediate constituent from dictionary selection and unregistered word.Embodiment shown in Fig. 4 and Fig. 5 has described this embodiment.

In step 202, generate the context of this unregistered word from collected works.

According to one embodiment of present invention, can utilize windowing mode, dependency tree mode or well known to a person skilled in the art that alternate manner generates the context of a word.

A given word, below mode that how to describe by windowing by example obtain the context of a word from collected works.Suppose that given word is " we ", suppose in collected works to comprise a plurality of sentences, one of them sentence is " necessarily carefully hold us everyone life road ", and to set a window size be 6.

At first, search this word in collected works.In this example, find sentence in collected works " necessarily carefully hold us everyone life road " and comprise " we " this word.

Next, the mode with windowing intercepts the word contiguous with " we " this word.Can in collected works, occur in the sentence or paragraph of this word, the mode that covers this word of take delimited the window that size is 6.In the mode that covers this word, can be for example 3 words (" everyone ") that intercept 3 words that this word is close to previously (" good assurance ") and be close to later centered by this word (i.e. " we "), can be for example also to using this word as beginning and intercept 6 words (" everyone life ") that are close to backward, can be for example also to using this word as ending and 6 words that intercepting is close to (" necessarily carefully holding ") forward, can be for example perhaps intercepting 1 or 2 word being close to previously of this word and 5 or 4 words that are close to later, etc.

After being truncated to number and equaling the word of window size, the word contiguous to intercepted and this word carries out participle.For example, when 3 words (" everyone ") of intercepting 3 words that this word is close to previously (" good assurance ") and be close to later centered by word (" we "), two groups of words that obtain are " good assurance " and " everyone ", these two groups of words are carried out to participle, for example can obtain following word segmentation result: " good " " assurance " " each " " people ".

Then, determine the weight of resulting each word after participle.The result obtained after participle can have the vector<v of a correspondence ₁, v ₂..., v _n, the number of the word segmentation result that wherein n is this word has 4 word segmentation result in above-mentioned example, so n=4, and v _iit is the weight (i=1...n) of equivalent.Weight has multiple computing method, for example TFIDF-word frequency * inverse document frequency, BOOL (whether existing), IDF-inverse document frequency and PMI-point type mutual information.Under normal conditions, the effect contribution degree that the number of times that the cliction up and down of a word occurs is judged the meaning of a word of this word is less, and the decision meaning whether occurs having, so, in a preferred embodiment of the present invention, can adopt the IDF-inverse document frequency to calculate weight.

By said process, can obtain resulting each word and weight thereof after participle, the context that resulting these words and weight thereof can be used as the given word of beginning is used.

In addition, can also be by searching unregistered word and analyze this unregistered word in the mode of dependency tree in collected works, thus the dependence that will analyze gained is used as the context that starts given word.

By above-described context generation method, can obtain the context of unregistered word.

In step 203, according to context and the synonym of unregistered word, determine the classification that this unregistered word is affiliated.

Can be accomplished in several ways the process of determining the classification under unregistered word according to the context of unregistered word and synonym.In the following detailed description to Fig. 3 to Fig. 5, provided the multiple specific implementation of determining the classification under unregistered word according to the context of unregistered word and synonym.

In embodiment shown in Fig. 3, at first, can be added up the synonym of unregistered word, be determined which classification these synonyms belong to; Then, generate the context of each classification, wherein the context of each classification is that the context of all words of comprising according to each classification generated from collected works obtains; Then, can utilize prior art similarity calculating method known or commonly used, calculate the similarity between the context of the context of unregistered word and each classification; Finally, classification that will be corresponding with maximum similarity is defined as the classification under unregistered word.

In embodiment shown in Fig. 4, at first, can generate synon context from collected works, this can use the implementation identical with the context that generates unregistered word in step 202; Then, calculate the context of unregistered word and the similarity between synon context; According to the similarity calculated, extract a set from the synonym of unregistered word, this set can comprise the synonym of predetermined number; Then, will be sued for peace with similarity in extracted set, that synonym that belong to identical category is corresponding; Finally, determine the classification under unregistered word according to the similarity after summation.In the embodiment shown in fig. 4, for example can use contiguous (K Nearest Neighbors, the be abbreviated as KNN) algorithm of K or well known to a person skilled in the art other method.

In embodiment shown in Fig. 5, at first, can generate synon context and calculate the context of unregistered word and the similarity between synon context from collected works; Then can, by the mode of utilizing weighting factor to be weighted calculated similarity, obtain more excellent similarity result; Then can determine the classification under unregistered word according to more excellent similarity.Particularly, at first, can generate synon context from collected works; Calculate the context of unregistered word and the similarity between synon context; Classification under the statistics synonym, receive the predetermined weight factor be associated with synonym, utilize the predetermined weight factor received, to be associated the corresponding similarity of synonym be weighted, extract a set from the synonym of unregistered word according to the similarity after weighting, this set can comprise the synonym of predetermined number; To be sued for peace with the similarity after in this set, that synonym that belong to identical category is corresponding weighting, and be determined the classification under unregistered word according to the similarity after summation

The following specifically describes the embodiment of Fig. 3 to Fig. 5.

Fig. 3 is the process flow diagram according to the class method for distinguishing for definite unregistered word of an alternative embodiment of the invention.

In step 301, receive a unregistered word.

In this embodiment, suppose that the unregistered word received is " ice crystal ".

In step 302, select to share with unregistered word the word of one or more combining forms from dictionary, as the synonym of unregistered word.

As previously mentioned, word-building rule can comprise combining form, composition attribute and composition relation etc., and combining form can comprise the word that forms word and/or immediate constituent etc., a given unregistered word and a dictionary, if the word in dictionary and unregistered word are shared one or more combining forms, all be identified as the synonym of unregistered word, and be placed in the synonym set.More than can think a synon specific implementation of selecting unregistered word based on word-building rule from dictionary.

Below take and share identical word and describe as example.For example, unregistered word is " ice crystal ", and this unregistered word comprises two words " ice " and " crystalline substance ".Suppose that the word that comprises " ice " this word in dictionary has " skates ", " refrigerator-freezer ", " ice rain ", " ice and snow ", the word that comprises " crystalline substance " this word has " crystal ", " crystal grain ", " crystal ", these words are all thought to the synonym of unregistered word " ice crystal ", the now synonym set of unregistered word={ " skates ", " refrigerator-freezer ", " ice rain ", " ice and snow ", " crystal ", " crystal grain ", " crystal " }.

In step 303, generate the context of unregistered word from collected works.

Can utilize windowing mode, dependency tree mode or well known to a person skilled in the art that alternate manner generates the context of unregistered word, specific implementation is described in step 202, does not repeat them here.

In step 304, the classification under the statistics synonym.

In this step, obtain respectively the affiliated classification of each synonym of unregistered word, then it is added up, determine all categories under these synonyms are respectively.

For example, " skates " belong to classification C1, and " refrigerator-freezer " belongs to classification C2, and " ice rain " belongs to classification C4, and " ice and snow " belongs to classification C4, and " crystal " belongs to classification C3, and " crystal grain " belongs to classification C3, and " crystal " belongs to classification C3.As previously mentioned, for each word in dictionary, can mark the information such as its part of speech, classification, the meaning of a word, example sentence, can directly obtain from dictionary so which classification each word belongs to.In addition, the classification of word also can manually be set.

In this example, the word that belongs to classification C1 has " skates ", and the word that belongs to classification C2 has " refrigerator-freezer ", and the word that belongs to classification C3 has " crystal ", " crystal grain ", " crystal ", and the word that belongs to classification C4 has " ice rain ", " ice and snow ".

Can obtain thus, the classification under the synonym of unregistered word " ice crystal " is C1, C2, C3 and C4.

In step 305, generate the context of all words that each classification comprises from collected works, as the context of each classification.

In this step, at first determine all words that each classification comprises.For example, suppose to determine that classification C1 also comprises " hilted broadsword ", " machete " except " skates ", be designated as C1={ " skates ", " hilted broadsword ", " machete " }; Classification C2 also comprises " refrigerator " except " refrigerator-freezer ", is designated as C2={ " refrigerator-freezer ", " refrigerator " }; And classification C3 only includes " crystal ", " crystal grain ", " crystal ", be designated as C3={ " crystal ", " crystal grain ", " crystal " }; Classification C4 only includes " ice rain ", " ice and snow ", is designated as C4={ " ice rain ", " ice and snow " }.

According to the described contextual method that generates word from collected works of step 202, can generate the context of each word comprised in above four classification C1-C4.The context of this classification can be thought in the context of all words that each classification comprises, the context of " skates " that for example classification C1 comprises, the context of " hilted broadsword " and the context of " machete " are combined the context that can be used as classification C1, be denoted as: the context of C1={ context of " skates ", the context of " hilted broadsword ", the context of " machete " }.

In step 306, the similarity between the context of calculating unregistered word and the context of each classification.

According to noted earlier, a vector can be seen as in the context of unregistered word, and the context of classification is owing to being the context that has combined its all words that comprise, so also can regard a vector as, therefore can utilize vectorial cosine distance to calculate two similarities between vector, this cosine distance is as shown in following formula (1):

CTS (X, Y) = \frac{Σ_{j = 1}^{n} x_{j} y_{j}}{\sqrt{Σ_{j = 1}^{n} {x_{j}}^{2}} \sqrt{Σ_{j = 1}^{n} {y_{j}}^{2}}} - - - (1)

Wherein, X and Y are two vectors, and n is the length of X and these two vectors of Y, x _jand y _jrepresent respectively j element in these two vectors of X and Y.

Specifically be applied in scene of the present invention, X can be the context of unregistered word, and Y can be the context of a classification, and x _jand y _jcan represent respectively j the corresponding weight of word in these two contexts of X and Y.In the different situation of element number that these two contexts of X and Y comprise, all elements that can extract these two vectors carrys out new context vector corresponding to reconstruct respectively: X ' and Y '.For X ', if element wherein in X, do not occur, so corresponding weight is set to zero.Similarity calculating to X and Y completes by the similarity of formula (1) calculating X ' and Y '.By the calculating of above-mentioned cosine distance, the similarity that can obtain between the context of the context of unregistered word and each classification is:

Sim (context (ice crystal), context (C1))=0.71,

Sim (context (ice crystal), context (C2))=0.67,

Sim (context (ice crystal), context (C3))=0.81,

Sim (context (ice crystal), context (C4))=0.65,

Wherein context (ice crystal) means the context of " ice crystal " this word, and context (C1) means the context of classification C1, and Sim (A, B) means the similarity of A and B.As can be seen here, the similarity between the context of unregistered word " ice crystal " and classification C1, C2, C3 and C4 context separately is respectively 0.71,0.67,0.81 and 0.65.

In addition, also can utilize and well known to a person skilled in the art that other method calculates the similarity between the two.

In step 307, classification that will be corresponding with maximum similarity is defined as the classification under unregistered word.

By the similarity relatively calculated in step 306, the similarity between the context of known unregistered word " ice crystal " and the context of classification C3 is the highest, the classification of unregistered word " ice crystal " can be defined as to classification C3 thus.

Fig. 4 is the process flow diagram according to the class method for distinguishing for definite unregistered word of an alternative embodiment of the invention.

In step 401, receive a unregistered word.

In this embodiment, identical with the embodiment of Fig. 3, suppose that the unregistered word received is " ice crystal ".

In step 402, determine the part of speech of unregistered word.

The part of speech of unregistered word can have multiple definite method.For example can utilize known various models to guess the part of speech of unregistered word, also can determine by artificial demarcation.In the present embodiment, suppose that unregistered word is that the part of speech of " ice crystal " is noun.

In step 403, select to share with unregistered word the word of combining form from dictionary.

For example, suppose that unregistered word is for " ice crystal ", identical with step 302, can determine that with the set of the shared word of unregistered word " ice crystal " be { " skates ", " refrigerator-freezer ", " ice rain ", " ice and snow ", " crystal ", " crystal grain " }.

Different from step 302, the direct synonym using above-mentioned set as unregistered word not now, but continue the part of speech filter process in execution step 404.

In step 404, select the synonym of the word identical with the part of speech of unregistered word as unregistered word in selected word.

As previously mentioned, word-building rule can comprise combining form, composition attribute and composition relation etc., forms such as the mark that can comprise word, length, part of speech etc. of minute attribute.In the embodiment shown in fig. 4, utilized the part of speech in the word-building rule to carry out the synon selection to unregistered word.

In the present embodiment, the part of speech that can determine unregistered word " ice crystal " from step 402 is noun, and above-mentioned set { " skates ", " refrigerator-freezer ", " ice rain ", " ice and snow ", " crystal ", " crystal grain " } in the part of speech of each word can obtain from dictionary, therefore in step 404, can select the noun in this set, as the synonym of unregistered word " ice crystal ".

In step 405, generate the context of unregistered word from collected works.

In step 406, from collected works, generate synon context.

Can utilize windowing mode, dependency tree mode or well known to a person skilled in the art that alternate manner generates synon context, specific implementation is described in step 202, does not repeat them here.

In step 407, calculate the context of unregistered word and the similarity between synon context.

A vector can be seen as in the context of unregistered word, and a vector also can be regarded as in synon context, therefore can utilize vectorial cosine range formula (1) to calculate two similarities between vector.

Specifically be applied in scene of the present invention, X can be the context of unregistered word, and Y can be a synon context of this unregistered word, and x _jand y _jcan represent respectively j the corresponding weight of element in these two contexts of X and Y.Therefore, by the calculating of above-mentioned cosine distance, can obtain the context of unregistered word and the similarity between its synon context and be:

Sim (context (ice crystal), context (skates))=0.30,

Sim (context (ice crystal), context (refrigerator-freezer))=0.67,

Sim (context (ice crystal), context (crystal))=0.81,

Sim (context (ice crystal), context (crystal grain))=0.74,

Sim (context (ice crystal), context (ice rain))=0.69,

Sim (context (ice crystal), context (ice and snow))=0.56,

Wherein context (ice crystal) means the context of unregistered word " ice crystal ", and context (skates) means the context of the synonym " skates " of unregistered word " ice crystal ", and Sim (A, B) means the similarity of A and B.As can be seen here, the similarity between the context of the context of unregistered word " ice crystal " and its synonym " skates ", " refrigerator-freezer ", " crystal ", " crystal grain ", " ice rain ", " ice and snow " is respectively 0.30,0.67,0.81,0.74,0.69 and 0.56.

In step 408, according to similarity, extract a set from the synonym of unregistered word.

Can preset the synon number in the set that will extract.In an example, this set can be set as comprising that the synonym of predetermined number, this predetermined number can be any numbers that is less than or equal to the synon sum of unregistered word.In the present embodiment, predetermined number is expressed as to K, and supposes that this predetermined number is 5, supposes K=5.

At first, can to the resulting similarity of step 407, be sorted according to size order.

In the present embodiment, step 407 is calculated 6 similarities altogether, can obtain following sequence to it after according to the sequence of order from big to small: 0.81,0.74,0.69,0.67,0.56,0.30, the synonym corresponding with the similarity in this sequence respectively: " crystal ", " crystal grain ", " ice rain ", " refrigerator-freezer ", " ice and snow ", " skates ".

Then, the synonym corresponding with the similarity of the predetermined number that comes front extracted in described set.

In the present embodiment, due to predetermined number K=5, and unregistered word always has 6 synonyms, so select front 5 similarities in the similarity of arranging from big to small, select 0.81,0.74,0.69,0.67,0.56, and synonym that will be corresponding with these similarities " crystal ", " crystal grain ", " ice rain ", " refrigerator-freezer ", " ice and snow " extract puts into a set, as the member of this set.

In step 409, will be sued for peace with similarity in this set, that synonym that belong to identical category is corresponding.

In this step, at first can determine the affiliated classification of synonym of unregistered word, this can carry out according to the mode described in step 304, thereby obtain the result identical with step 304, the word that belongs to classification C2 has " refrigerator-freezer ", the word that belongs to classification C3 has " crystal ", " crystal grain ", " crystal ", and the word that belongs to classification C4 has " ice rain ", " ice and snow ".As can be seen here, the synonym comprised in the set that step 408 is extracted belongs to respectively classification C2, C3 and C4.

Then, by the context of unregistered word and belong to the similarity summation between the synon context of identical category, thereby obtain the similarity between this unregistered word and each classification, for example:

Sim (ice crystal, C2)=Sim (context (ice crystal), context (refrigerator-freezer))=0.67,

Sim (ice crystal, C3)=Sim (context (ice crystal), context (crystal))+Sim (context (ice crystal), context (crystal grain))=1.55,

Sim (ice crystal, C4)=Sim (context (ice crystal), context (ice rain))+Sim (context (ice crystal), context (ice and snow))=1.25.

In step 410, according to the similarity after summation, determine the classification under unregistered word.

Sorted for the similarity between the resulting unregistered word of step 409 and each classification, can be obtained unregistered word " ice crystal " the highest with the similarity of classification C3, therefore classification C3 can be defined as to the classification of unregistered word.

In addition, in some embodiments of the invention, can also utilize Else Rule to determine the classification under unregistered word according to the similarity after summation.For example, the maximum similarity between unregistered word and each classification can be do not chosen, and the classification of unregistered word will be the classification corresponding with the intermediate value in these similarities be defined as.

In step 501, receive a unregistered word.

In this embodiment, suppose that the unregistered word received is " electrical machinery plant ".

In step 502, select to share with unregistered word the word of one or more combining forms from dictionary, as the synonym of unregistered word.

Similar with step 302, in step 502, based on word-building rule, for the selected synonym of this unregistered word, be " energising ", " incoming call ", " making a phone call ", " electrical apparatus factory ", " factory director ", " factory owner ".

In step 503, generate the context of unregistered word from collected works.

In step 504, from collected works, generate synon context.

Can utilize windowing mode, dependency tree mode or well known to a person skilled in the art that alternate manner generates the synon context of unregistered word, specific implementation is described in step 202, does not repeat them here.

In step 505, calculate the context of unregistered word and the similarity between synon context.

This step and step 407 are similar, do not repeat them here.Can obtain the context of unregistered word " electrical machinery plant " and the similarity between its synon context is in step 505:

Sim (context (electrical machinery plant), context (energising))=0.10,

Sim (context (electrical machinery plant), context (incoming call))=0.27,

Sim (context (electrical machinery plant), context (making a phone call))=0.45,

Sim (context (electrical machinery plant), context (electrical apparatus factory))=0.30,

Sim (context (electrical machinery plant), context (factory director))=0.30,

Sim (context (electrical machinery plant), context (factory owner))=0.20.

In step 506, the classification under the statistics synonym.

This step can be carried out according to the mode described in step 304, can obtain: the word that belongs to classification C1 has " energising ", the word that belongs to classification C2 has " incoming call " and " making a phone call ", and the word that belongs to classification C3 has " electrical apparatus factory ", and the word that belongs to classification C4 has " factory director ", " factory owner ".

In step 507, receive the predetermined weight factor be associated with synonym.

For the judgement of classification, the context of a word is extremely important, and the structural information of another one word is also extremely important for the judgement of classification.Therefore, the present invention proposes the concept of hybrid similarity, the structural information of utilizing word is weighted context and the contextual similarity of synonym of unregistered word.In the present embodiment, the structural information of word is for example predetermined weight factor λ (w, w _i).Utilize predetermined weight factor to be weighted shown in formula specific as follows context and the contextual similarity of synonym of unregistered word:

Sim(w，w _i)＝λ(w，w _i)*CTS(w，w _i) (2)

Wherein w is unregistered word, w _ithe synonym of unregistered word, λ (w, w _i) refer to based on unregistered word w and synonym context w thereof _ithe weighting factor of structural information, and CTS (w, w _i) be context and the synonym context w of unregistered word w _isimilarity.

Can use various ways to specify this weighting factor.In one embodiment, the appointment of weighting factor need to meet following strategy:

If unregistered word w and synonym w _ishare last character, share the penultimate character, so by weighting factor λ (w, w simultaneously _i) be set as λ ₁, λ (aluminium alloy, ferroalloy)=λ for example ₁;

Otherwise, if unregistered word w and synonym w _ishare the first character, share last character, so by weighting factor λ (w, w simultaneously _i) be set as λ ₂, λ (electrical machinery plant, electrical apparatus factory)=λ for example ₂;

Otherwise, if unregistered word w and synonym w _ishare the first character or share last character, so by weighting factor λ (w, w _i) be set as λ ₃, λ (the base people, citizen)=λ for example ₃;

Under other situations, by weighting factor λ (w, w _i) be set as λ ₄.

λ wherein ₁>=λ ₂>=λ ₃>=λ ₄, and corresponding numeral can obtain by experiment.

In step 508, utilize predetermined weight factor, the similarity corresponding to the synonym with being associated is weighted.

In an example, can be by λ (electrical machinery plant according to step 507, energising), λ (electrical machinery plant, incoming call), λ (electrical machinery plant, make a phone call), λ (electrical machinery plant, factory director), λ (electrical machinery plant, factory owner) be set as respectively λ 4=0.382, and λ (electrical machinery plant, electrical apparatus factory) is set as to λ 2=10.

The context of the above-mentioned weighting factor obtained according to step 507 and the unregistered word obtained according to step 505 and the similarity between synon context can be applied to formula (2), thus the as follows similarity obtained after weighting:

Sim (electrical machinery plant, energising)=Sim (context (electrical machinery plant), context (energising)) * λ 4=0.10* λ 4=0.038,

Sim (electrical machinery plant, incoming call)=Sim (context (electrical machinery plant), context (incoming call)) * λ 4=0.27* λ 4=0.103,

Sim (electrical machinery plant makes a phone call)=Sim (context (electrical machinery plant), context (making a phone call)) * λ 4=0.45* λ 4=0.172,

Sim (electrical machinery plant, electrical apparatus factory)=Sim (context (electrical machinery plant), context (electrical apparatus factory)) * λ 4=0.30* λ 2=3.0,

Sim (electrical machinery plant, factory director)=Sim (context (electrical machinery plant), context (factory director)) * λ 4=0.30* λ 4=0.115,

Sim (electrical machinery plant, factory owner)=Sim (context (electrical machinery plant), context (factory owner)) * λ 4=0.20* λ 4=0.076.

In step 509, according to similarity, extract a set from the synonym of unregistered word.

This step and step 408 are similar.At first, can the similarity after to the resulting weighting of step 507 be sorted according to size order.Then, the synonym corresponding with the similarity of the predetermined number that comes front extracted in described set.

In the present embodiment, same hypothesis predetermined number K=5, so front 5 similarities in the similarity of arranging from big to small are selected, select 3.0,0.172,0.115,0.103,0.076, and synonym that will be corresponding with these similarities " electrical apparatus factory ", " making a phone call ", " factory director ", " incoming call ", " factory owner " extract and put into a set, as the member of this set.

In step 510, will be sued for peace with the similarity after in extracted set, that synonym that belong to identical category is corresponding weighting.

This step 510 is similar with step 409.

At first known according to the result of step 506, " incoming call " in the set of extracting and the classification of " making a phone call " are C2, and the classification of " electrical apparatus factory " is C3, and the classification of " factory director " and " factory owner " is C4.As can be seen here, the synonym comprised in the set that step 509 is extracted belongs to respectively classification C2, C3 and C4, and these classifications are also candidate's classification of unregistered word.

Sim (electrical machinery plant, C2)=Sim (electrical machinery plant makes a phone call)+Sim (electrical machinery plant, incoming call)=0.275,

Sim (electrical machinery plant, C3)=Sim (electrical machinery plant, electrical apparatus factory)=3.0,

Sim (electrical machinery plant, C4)=Sim (electrical machinery plant, factory director)+Sim (electrical machinery plant, factory owner)=0.191.

In step 511, according to the similarity after summation, determine the classification under unregistered word.

Sorted for the similarity between the resulting unregistered word of step 510 and each classification, can be obtained unregistered word " electrical machinery plant " the highest with the similarity of classification C3, therefore classification C3 can be defined as to the classification of unregistered word.

The present invention is by the synonym of selecting unregistered word based on word-building rule from dictionary and the context that generates unregistered word from collected works, thereby, according to context and the synonym of unregistered word, determines the classification that unregistered word is affiliated.The invention solves the problem of the low performance of prior art; Solved automatically selects synonym to realize the classification On The Choice of high coverage based on word-building rule from existing dictionary; And solved and how word structure information and contextual information have been merged to the problem of accurately calculating acceptation similarity.

Method of the present invention can realize in the combination of software, hardware or software and hardware.Hardware components can utilize special logic to realize; Software section can be stored in storer, and by suitable instruction execution system, for example microprocessor, personal computer (PC) or large scale computer are carried out.

It should be noted that for the present invention is easier to understand, top description has been omitted to be known for a person skilled in the art and may to be essential some ins and outs more specifically for realization of the present invention.

The purpose that instructions of the present invention is provided is in order to illustrate and to describe, rather than is used for exhaustive or limits the invention to disclosed form.For those of ordinary skill in the art, many modifications and changes are all apparent.

Therefore; selecting and describing embodiment is in order to explain better principle of the present invention and practical application thereof; and those of ordinary skills are understood, under the prerequisite that does not break away from essence of the present invention, within all modifications and change all fall into protection scope of the present invention defined by the claims.

Claims

1. one kind for determining the class method for distinguishing of unregistered word, comprising:

Select the synonym of described unregistered word from dictionary based on word-building rule;

Generate the context of described unregistered word from collected works; And

According to context and the described synonym of described unregistered word, determine the classification that described unregistered word is affiliated;

Wherein can determine the classification under described unregistered word according to the context of described unregistered word and described synonym by following any mode:

Add up the affiliated classification of described synonym;

Generate the context of all words that each classification comprises from collected works, as the context of described each classification;

Calculate the similarity between the context of the context of described unregistered word and each classification; And

Classification that will be corresponding with maximum similarity is defined as the classification under described unregistered word;

Or

Generate described synon context from collected works;

Calculate the context of described unregistered word and the similarity between described synon context;

According to described similarity, extract a set from described synonym;

To be sued for peace with similarity in described set, that synonym that belong to identical category is corresponding; And

Determine the classification under unregistered word according to the similarity after summation;

Or

Generate described synon context from collected works;

Add up the affiliated classification of described synonym;

Receive the predetermined weight factor be associated with described synonym;

Utilize the predetermined weight factor received, the similarity corresponding to the synonym with being associated is weighted;

According to described similarity, extract a set from described synonym;

To be sued for peace with the similarity after in described set, that synonym that belong to identical category is corresponding weighting; And

Determine the classification under unregistered word according to the similarity after summation.

2. method according to claim 1, wherein said word-building rule comprises combining form, composition attribute and composition relation.

3. method according to claim 2, wherein based on word-building rule, from dictionary, select the synon step of described unregistered word to comprise:

Select to share with described unregistered word the word of one or more combining forms from described dictionary, as the synonym of described unregistered word.

4. method according to claim 2, wherein based on word-building rule, from dictionary, select the synon step of described unregistered word to comprise:

Determine the part of speech of described unregistered word;

Select to share with described unregistered word the word of one or more combining forms from described dictionary; And

Select the word identical with the part of speech of described unregistered word in selected word, as the synonym of described unregistered word.

5. method according to claim 1 wherein comprises from the contextual step that collected works generate described unregistered word:

Search described unregistered word in described collected works;

Mode with windowing intercepts the word contiguous with described unregistered word;

The intercepted word with described unregistered word vicinity is carried out to participle; And

Determine the weight of resulting each word after participle, so as will be after participle resulting each word and weight thereof as the context of described unregistered word, use.

6. method according to claim 1 wherein comprises from the contextual step that collected works generate described unregistered word:

Search described unregistered word in collected works; And

Analyze the dependence of described unregistered word in the mode of dependency tree, the context as described unregistered word by described dependence of usining is used.

7. method according to claim 1, the appointment of wherein said predetermined weight factor meets following strategy:

If the last character and shared penultimate word shared in the word in unregistered word and classification, will be set as λ with the predetermined weight factor that described classification is associated ₁; Otherwise,

If first character and shared the last character shared in the word in unregistered word and classification, will be set as λ with the predetermined weight factor that described classification is associated ₂; Otherwise,

If the word in unregistered word and classification is only shared first character or only shared the last character, will be set as λ with the predetermined weight factor that described classification is associated ₃; Otherwise

To be set as λ with the predetermined weight factor that described classification is associated ₄,

λ wherein ₁>=λ ₂>=λ ₃>=λ ₄.

8. method according to claim 1, the step of wherein according to described similarity, from described synonym, extracting a set comprises:

According to size order, described similarity is sorted; And

The synonym corresponding with the similarity of the predetermined number that comes front extracted in described set.

9. the equipment for the classification of determining unregistered word comprises:

The synonym selector switch, be configured to select from dictionary based on word-building rule the synonym of described unregistered word;

The context maker, be configured to generate from collected works the context of described unregistered word; And

The classification determiner, be configured to determine the classification under described unregistered word according to the context of described unregistered word and described synonym;

Wherein, described classification determiner comprises:

For adding up the device of the affiliated classification of described synonym;

For generate the contextual device of the context of all words that each classification comprises as described each classification from collected works;

Device for the similarity between the context that calculates the context of described unregistered word and each classification; And

Be defined as the device of the classification under described unregistered word for classification that will be corresponding with maximum similarity;

Or;

Wherein said context maker comprises for from collected works, generating described synon contextual device, and described classification determiner comprises:

For calculating the context of described unregistered word and the device of the similarity between described synon context;

For extract the device of a set from described synonym according to described similarity;

For the device that will be sued for peace with similarity described set, that synonym that belong to identical category is corresponding; And

Determine the device of the classification under unregistered word for the similarity according to after summation;

Or

For adding up the device of the affiliated classification of described synonym;

For receiving the device of the predetermined weight factor be associated with described synonym;

For utilizing the predetermined weight factor of reception, the device that corresponding similarity is weighted to the synonym with being associated;

For the device that will be sued for peace with the similarity after described set, that synonym that belong to identical category is corresponding weighting; And

Determine the device of the classification under unregistered word for the similarity according to after summation.

10. equipment according to claim 9, wherein said word-building rule comprises combining form, composition attribute and composition relation.

11. equipment according to claim 10, wherein said synonym selector switch comprises:

For select to share with described unregistered word the word of one or more combining forms from described dictionary, as the synon device of described unregistered word.

12. equipment according to claim 10, wherein said synonym selector switch comprises:

The device that is used for the part of speech of definite described unregistered word;

For select to share with described unregistered word the device of the word of one or more combining forms from described dictionary; And

For at selected word, selecting the word identical with the part of speech of described unregistered word, as the synon device of described unregistered word.

13. equipment according to claim 9, wherein said context maker comprises:

For search the device of described unregistered word at collected works;

Intercept the device of the word contiguous with described unregistered word for the mode with windowing;

For the intercepted word with described unregistered word vicinity being carried out to the device of participle; And

For determining the weight of resulting each word after participle, so as will be after participle resulting each word and weight thereof as the contextual device of described unregistered word.

14. equipment according to claim 9, wherein said context maker comprises:

For search the device of described unregistered word at collected works; And

For in the mode of dependency tree, analyzing the dependence of described unregistered word, using the device that described dependence is used as the context of described unregistered word.

15. equipment according to claim 9, the appointment of wherein said predetermined weight factor meets following strategy:

λ wherein ₁>=λ ₂>=λ ₃>=λ ₄.

16. equipment according to claim 9 is wherein said for comprising from the device of a set of described synonym extraction according to described similarity:

For the device described similarity sorted according to size order; And

For the synonym corresponding with the similarity of the predetermined number that comes front extracted to the device of described set.