CN109033307A

CN109033307A - Word polyarch vector based on CRP cluster indicates and Word sense disambiguation method

Info

Publication number: CN109033307A
Application number: CN201810783010.5A
Authority: CN
Inventors: 李国佳; 郭鸿奇; 杨喜亮; 王国卿; 杨振中
Original assignee: North China University of Water Resources and Electric Power
Current assignee: North China University of Water Resources and Electric Power
Priority date: 2018-07-17
Filing date: 2018-07-17
Publication date: 2018-12-18
Anticipated expiration: 2038-07-17
Also published as: CN109033307B

Abstract

The invention discloses a kind of word polyarch vector expression based on CRP cluster and Word sense disambiguation methods, including step 1: carrying out purification pretreatment to the text in mass text corpus and obtain plain text, the contextual window of target ambiguity word is concentrated to indicate based on CRP algorithm cluster corpus of text, concentrate target ambiguity word according to clustering cluster category label corpus of text, the polyarch vector that training obtains ambiguity word on the corpus of text collection of label indicates；Step 2: target short text is pre-processed and obtains short text sequence of terms, identify target ambiguity word in sequence of terms, it calculates the expression of target ambiguity word contextual window and corpus of text concentrates similarity between each clustering cluster mass center corresponding to the word, by term vector corresponding to similarity maximum value clustering cluster classification indicate as ambiguity word within a context the specific meaning of a word term vector indicate, to polysemant progress word sense disambiguation.The present invention solve the problems, such as word expression in polysemy indicate and the representation of word meaning in ambiguity identification problem.

Description

Word polyarch vector based on CRP cluster indicates and Word sense disambiguation method

Technical field

The present invention relates to natural language processing field, in particular to a kind of word polyarch vector table based on CRP cluster Show and Word sense disambiguation method.

Background technique

In numerous tasks of natural language processing field, the basic problem faced is how linguistic notation to be expressed as machine The manageable coding mode of device.Mapping expression is carried out to linguistic notation, word, sentence, text etc. are expressed as one continuously Low-dimensional vector, realize word, sentence, text semantic vectorization indicate, information retrieval, short text classification, name entity The tasks such as identification, sentiment analysis, recommended engine, automatic text summarization are widely used.

Word is the most basic component units of language, and the vectorization expression of word has wide in natural language processing task General application.The shortcomings that a kind of simple word vectors expression is One-hot Representation, this representation method be Vector dimension is equal to the number of all words, there are problems that dimension disaster, can not portray the semantic relation between word, simultaneously Different semantic meaning representations can not be reflected for ambiguity word.

The term vector of word indicates that (Word Embedding or Word Representation) is a kind of regular length Low-dimensional real vector indicates, is learnt by the training to mass text, and obtaining the unique vector of each word indicates that feature is phase Seemingly or relevant word is apart from upper closer.But due to the presence of ambiguity word in word, the same word symbol It may reflect that different semantemes, most of traditional word term vectors indicate only corresponding unique word in different context of co-texts Vector indicates, is unable to the different meaning of a word of effectively expressing ambiguity word.Each meaning of a word of ambiguity word should corresponding one to Amount indicates.

Word polyarch vector indicates that corresponding to a term vector for each meaning of a word of ambiguity word indicates, can improve word The accuracy that language indicates.The vector for obtaining the word difference meaning of a word indicates, usually using the model based on cluster, by clustering word Context carries out meaning of a word conclusion, is directly clustered to the context of original text word or utilized across linguistry progress semantic It is clustered after mapping, retraining obtains the corresponding term vector expression of word specific meaning of a word in different context of co-texts.

The side that polysemant words and phrases vector indicates is obtained based on k-means clustering algorithm and the training of neural network language model The size of method, parameter k (cluster classification) needs to select different numerical value according to polysemant words and phrases justice number.And it is poly- based on CRP The word polyarch vector of class indicates that training process does not need specified cluster class number in advance, meets different ambiguity words and exists The inconsistent actual conditions of meaning of a word number in context.

The word representation of word meaning of high quality can capture semantic and syntactic information abundant, facilitate word sense disambiguation.High quality Word sense disambiguation can preferably learn the expression of the word meaning of a word.Word sense disambiguation main method has two classes: being based on external knowledge library side Method and method based on corpus.Based on external knowledge library method, by external knowledge library (WordNet or HowNet) to word It is specifically semantic to carry out discrimination identification ambiguity word for different semantic explanations or description, but the building of external knowledge library or dictionary needs Expend a large amount of manpower and material resources.Method based on corpus passes through automatically or semi-automatically using corpus as knowledge resource It practises and determines the specific meaning of a word of word in a given context, to realize word sense disambiguation.

To the ambiguity word in sentence, using text corpus, the word polyarch vector based on acquisition is indicated, by giving Word sense disambiguation method out obtains the word specific meaning of a word within a context, helps to improve the expression efficiency of word and sentence.

Internet technology and mobile application gradually popularize daily life, people using mobile terminal carry out information transmitting and Communication becomes increasingly prevalent, and thereby produces the data of magnanimity, such as headline, micro-blog information, shopping platform Commodity or service describing, forum's comment, intelligent interaction application and social conversation message etc., these data are usually by text structure At it is a kind of typical short text form that length is shorter, and this short text data contains the information of a large amount of high values, is had very High researching value.It is effectively handled using short text data of the machine to magnanimity on internet and understanding has become nature The important Research Challenges and hot spot of Language Processing and machine learning field.

In the similarity calculation of information retrieval, word polyarch vector is indicated and Word sense disambiguation method can distinguish retrieval The specific meaning of a word of ambiguity word in object improves the accuracy that word is indicated and calculated.For the short text in information retrieval field Retrieval provides a kind of effective phrase semantic and indicates and Word sense disambiguation method, provides technical support for semantic computation.

Summary of the invention

The purpose of the present invention is overcoming above-mentioned problems of the prior art, a kind of word based on CRP cluster is provided Polyarch vector indicates and Word sense disambiguation method, and word polyarch vector indicates corresponding for each meaning of a word of ambiguity word One term vector indicates solve the problems, such as the expression of polysemy in word expression, indicate based on word polyarch vector Word sense disambiguation method solves the problems, such as the identification of the ambiguity in the representation of word meaning.

The technical scheme is that the word polyarch vector based on CRP cluster indicates and Word sense disambiguation method, including Following steps:

Step S1 carries out purification pretreatment to the text in mass text corpus and obtains plain text, poly- based on CRP algorithm The contextual window of target ambiguity word indicates in class text corpus, and the target ambiguity word concentrated to corpus of text is according to poly- Class cluster classification is marked, and the polyarch vector that training obtains ambiguity word on the corpus of text collection of label indicates；

Step S2 carries out the sequence of terms that pretreatment obtains short text to target short text, identifies the mesh in sequence of terms Ambiguity word is marked, the contextual window for calculating target ambiguity word indicates to concentrate with corpus of text each poly- corresponding to the word Similarity between class cluster mass center, term vector corresponding to the clustering cluster classification by similarity maximum value indicates, exists as ambiguity word The term vector of the specific meaning of a word indicates in context, carries out word sense disambiguation to polysemant.

Purification pretreatment is carried out to the text in mass text corpus described in above-mentioned steps S1 and obtains plain text, packet It includes: deleting the text that number of words is less than preset threshold；The complex form of Chinese characters is uniformly converted into simplified Chinese character；Using Chinese and English abbreviation dictionary, to text The abbreviation of this corpus Chinese and English is replaced using Chinese word；The text concentrated to corpus of text segments；Remove stop words； Delete other characters outside non-Chinese character and number；Count word frequency；The word frequency of frequent words is preset as upper limit threshold；Selection text The word that frequency of occurrence is greater than predetermined lower threshold value in this corpus establishes word lists；Polysemant word is established based on polysemous dictionary Language table.

The contextual window of target ambiguity word described in above-mentioned steps S1 indicates that method is by word context The term vector of word averagely obtains, specific formula for calculation are as follows:

Wherein, veC is that the contextual window of word indicates, w_iFor in word contextual window set of words Context I-th of word, vec (w_i) it is word w_iInitial term vector.

The contextual window table of target ambiguity word is concentrated described in above-mentioned steps S1 based on CRP algorithm cluster corpus of text Show, representation method includes the following steps:

Step S101 obtains the ambiguity word and concentrates the contextual window of all samples to indicate in corpus of text；

Step S102 obtains the initial clustering cluster mass center of CRP clustering algorithm, a random sample is taken to cluster as CRP Initial clustering cluster mass center, or indicate to carry out initial clustering based on contextual window of the k-means algorithm to ambiguity word, it will wrap Clustering cluster mass center containing most number of samples is as initial clustering cluster mass center；

Step S103 indicates the contextual window of all samples of ambiguity word, for all clustering clusters, meter The similarity between each sample and each clustering cluster mass center is calculated, the maximum between i-th of sample and t-th of clustering cluster mass center is obtained Similarity Smax；If Smax is greater than preset threshold α, i-th of sample is divided into t-th of clustering cluster, the sample in clustering cluster t This quantity adds 1, recalculates the mass center of t-th of clustering cluster；Otherwise, new clustering cluster is generated, clustering cluster total number K increases by 1, newly gathers Sample size is 1 in class cluster, and the mass center of new clustering cluster is sample i；

Step S104 obtains sample, the mass center of clustering cluster and the sum of clustering cluster in each clustering cluster.

Training obtains the polyarch vector expression of ambiguity word on the corpus of text collection of label described in above-mentioned steps S1, Its representation method includes the following steps:

Step S201, to all samples of target ambiguity word described in corpus of text collection, according to affiliated clustering cluster into Line flag, different clustering clusters represent the different meaning of a word of target word；

Step S202, executing the word term vector based on neural network language model in the clustering cluster of label indicates training Process obtains the polyarch vector expression that word expresses the specific meaning of a word in different contexts.

Word sense disambiguation is carried out to polysemant described in above-mentioned steps S2, is included the following steps:

Step S301 pre-processes the target short text, the sequence of terms of short text is obtained, according to word Polyarch vector indicates to identify the ambiguity word in the sequence of terms；

Step S302 carries out word sense disambiguation to the ambiguity word, it is upper in short text sequence of terms to calculate word Hereafter window indicates and corpus of text concentrates the similarity between each clustering cluster mass center corresponding to the word, extracts similarity Term vector corresponding to the clustering cluster classification of maximum value indicates, as ambiguity word express within a context the word of the specific meaning of a word to Amount indicates.

The sequence of terms that pretreatment obtains short text is carried out to target short text described in above-mentioned steps S2, including removal stops Word, the complex form of Chinese characters are converted into simplified Chinese character；Using Chinese and English abbreviation dictionary, Chinese word is used to the english abbreviation in target short text Language is replaced；Word segmentation processing is carried out to short text；Other characters outside non-Chinese character and number are replaced using additional character.

Beneficial effects of the present invention: in the embodiment of the present invention, a kind of word polyarch vector table based on CRP cluster is provided Show and Word sense disambiguation method, the contextual window table of target word is concentrated using the clustering algorithm cluster corpus of text based on CRP Show, in the clustering cluster of label training obtain ambiguity word term vector indicate, improve ambiguity word vector table show it is accurate Degree solves the problems, such as the expression of polysemy in word expression.To the ambiguity word in sentence, the polyarch vector table of word is utilized Show, the similarity in the contextual window expression and training sample by calculating ambiguity word between the word cluster cluster mass center will Term vector corresponding to the clustering cluster of similarity maximum value indicates, the term vector as ambiguity word certain semantic within a context It indicates, eliminates the ambiguousness of ambiguity word.

Word polyarch vector representation method proposed by the present invention based on CRP cluster, using the clustering algorithm based on CRP The context for clustering all samples of target ambiguity word indicates that a clustering cluster result represents the semanteme of target word one kind, Training obtains word polyarch vector and indicates in the clustering cluster corpus of label.The polyarch vector expression of word can distinguish table Show the different meaning of a word of ambiguity word, solves the problems, such as the expression of polysemy.

The present invention indicates to carry out using contextual window of the clustering algorithm based on CRP to all samples of target ambiguity word Cluster, CRP algorithm cluster do not need specified cluster number in advance, and the clustering cluster number energy effectively expressing ambiguity word of acquisition is not With the quantity of the meaning of a word, solves the inconsistent practical problem of different polysemant words and phrases justice numbers, utilize word contextual window table That shows belongs to the judgment criteria of same clustering cluster as word, and calculating process is simple.

The Word sense disambiguation method proposed by the present invention indicated based on word polyarch vector, can identify the ambiguity in sentence Word, and obtain word within a context the specific meaning of a word term vector indicate, eliminate ambiguity word in different context of co-texts In ambiguousness.The contextual window for calculating target ambiguity word indicates to concentrate with corpus of text each poly- corresponding to the word Similarity between class cluster mass center, term vector corresponding to the clustering cluster classification by similarity maximum value indicates, exists as ambiguity word The term vector of the specific meaning of a word indicates in context, has carried out word sense disambiguation to ambiguity word.

Detailed description of the invention

Fig. 1 is based on the word polyarch vector expression of CRP cluster and the overall flow figure of word sense disambiguation；

Fig. 2 is the flow chart indicated based on CRP cluster ambiguity word contextual window；

Fig. 3 is the training process that the word polyarch vector based on CRP cluster indicates；

Fig. 4 is the word sense disambiguation flow chart indicated based on word polyarch vector；

Fig. 5 is the noun word sense disambiguation result indicated based on word polyarch vector；

Fig. 6 is the verb word sense disambiguation result indicated based on word polyarch vector.

Specific embodiment

With reference to the accompanying drawing, the specific embodiment of the present invention is described in detail, it is to be understood that of the invention Protection scope be not limited by the specific implementation.

The invention discloses a kind of word polyarch vector expression based on CRP cluster and Word sense disambiguation methods, such as Fig. 1 institute Show, basic ideas of the invention are the polyarch vectors that word is constructed on the basis of indicating based on CRP cluster word context It indicates, identifies the ambiguity word in sentence or short text, eliminate the ambiguousness of ambiguity word, it is specific within a context to obtain word The term vector of the meaning of a word indicates that the polyarch expression of term vector being capable of different languages of the more Precise Representation word in context of co-text Justice.The specific steps of the present invention are as follows:

In step sl, purification pretreatment is carried out to the text in mass text corpus and obtains plain text: to disclosed Or the corpus of text collection that acquisition obtains, delete the text that number of words is less than preset threshold；The complex form of Chinese characters that corpus of text is concentrated turns Turn to simplified Chinese character；The english abbreviation that corpus of text is concentrated is replaced using Chinese word using Custom Dictionaries；Then it adopts Word segmentation processing is carried out with Words partition system；Other characters unless outside Chinese character and number are removed using the matched method of canonical；It goes Except stop words and count word frequency；The word frequency of frequent words is preset as upper limit threshold；Occurrence is finally concentrated out according to corpus of text The word that number is greater than predetermined lower threshold value establishes word lists.

In step sl, obtain ambiguity word concentrates the contextual window of all samples to indicate that window is big in corpus of text Small to be set as a positive integer, each contextual window indicates the term vector weighted calculation by word in window；

In step sl, as shown in Fig. 2, the contextual window based on CRP algorithm cluster all samples of target ambiguity word It indicates, specifically:

1. indicating to carry out initial clustering based on contextual window of the k-means algorithm to ambiguity word, each cluster is obtained Cluster and its mass center；

2. using the clustering cluster mass center comprising most number of samples as the initial clustering cluster mass center of CRP clustering algorithm；

3. the contextual window of pair all samples of ambiguity word indicates, for all clustering clusters, calculate each sample and Similarity between each clustering cluster mass center obtains the maximum similarity Smax between i-th of sample and t-th of clustering cluster mass center；

4. if i-th of sample is divided into t-th of clustering cluster, the sample in clustering cluster t Smax is greater than preset threshold α This quantity adds 1, recalculates the mass center of t-th of clustering cluster.Otherwise, new clustering cluster is generated, clustering cluster total number K increases by 1, newly gathers Sample size is 1 in class cluster, and the mass center of new clustering cluster is sample i；

5. obtaining sample, the mass center of clustering cluster and the sum of clustering cluster in each clustering cluster.

Wherein, the 1st, 2 steps can simplify using first sample or to take a random sample as CRP and clustering and is initial Clustering cluster mass center.

In step sl, as shown in figure 3, the polyarch vector that training obtains word indicates, specifically:

All contextual windows are concentrated to indicate in corpus of text 1. obtaining target ambiguity word；

2. the contextual window based on CRP algorithm cluster ambiguity word indicates, the clustering cluster of word context expression is obtained；

3. pair target ambiguity word, finds accordingly in urtext corpus according to target word and its context Position carries out corresponding category label in target text corpus according to clustering cluster belonging to sample, and different clustering clusters represents The different semanteme of target word；

4. pair each ambiguity word, execution step 1,2,3, by the category label of clustering cluster to target text corpus In.

5. the polyarch vector for obtaining word based on CBOW model training on the corpus of text collection of label indicates.

In step s 2, as shown in figure 4, based on ambiguity words recognition and word sense disambiguation that word polyarch vector indicates, Specifically:

1. a pair target short text pre-processes, specifically include: removal stop words, the complex form of Chinese characters are converted into simplified Chinese character；It utilizes Chinese and English abbreviation dictionary, the english abbreviation in target sentences is replaced using Chinese word；Short text is carried out at participle Reason；Other characters outside non-Chinese character and number are replaced using additional character, obtain the sequence of terms of short text.

2. identifying the word of ambiguity in sentence.The word of ambiguity in identification sequence of terms is indicated according to word polyarch vector Language, ambiguity word there are two or more term vector indicate.

3. the contextual window for calculating ambiguity word indicates.Contextual window indicates adding by the term vector of context words Weight average value indicates, for the ambiguity word occurred in context words, concentrates the word frequency of occurrence using in corpus of text Term vector corresponding to most clustering clusters is as the term vector for participating in calculating, to unidentified word using word in contextual window The average value of words and phrases vector is indicated.

4. pair ambiguity word, according to phrase semantic quantity number, according to more sequences after first few successively to ambiguity word It is disambiguated.

5. calculate ambiguity word indicates between the mass center of training sample clustering cluster in the contextual window in short text sequence Similarity, by term vector corresponding to the clustering cluster of similarity maximum value indicate as ambiguity word term vector expression.

According to the thought that is determined by its context of semanteme of word, ambiguity word certain semantic within a context passes through meter The contextual window for calculating ambiguity word indicates between the corresponding corpus of text clustering cluster mass center of each term vector of ambiguity word Similarity obtains, and term vector corresponding to similarity maximum value is indicated as ambiguity word certain semantic within a context Term vector expression, circular are as follows:

Vec (w)={ vec_k(w)|k,Sim(veC,vec_k(w))=Max (Sim (veC, vec_j(w)))} (2)

Wherein, vec (w) is that ambiguity word w corresponding certain semantic term vector in contextual window indicates, vec_j(w) For the term vector expression of j-th of semantic corresponding corpus of text clustering cluster mass center of ambiguity word w, Max (Sim (veC, vec_j (w)) veC and each vec) are indicated for ambiguity word w contextual window_j(w) maximum value of similarity, by maximum value corresponding K term vector indicates the term vector as word w certain semantic.

Term is explained: the abbreviation of CRP:Chinese Restaurant Process, and Chinese is " Chinese restaurant's mistake Journey " is typical Dirichlet (Di Li Cray) process mixed model, its advantage is that establishing the number of mixed model classification Mesh is without artificial specified, the clustering problem being suitble in natural language processing.

The term vector polyarch of ambiguity word multiple meaning of a word in different contexts indicates.In table 1, the word pair of no label The expression of word term vector, such as " apple " are answered, is the word for not distinguishing ambiguity.How former the specific meaning of a word corresponding word of word is Type vector indicates, for example, " apple 2 " indicates the 2nd meaning of a word of word " apple ", refers to the apple of agricultural product." apple 1 " Term vector indicates corresponding as IT company with it, and " apple 2 " then indicates it as a kind of meaning of fruit.Word polyarch to Amount indicates that the semantic information of difference word can be captured.

The most close word of word or the meaning of a word of the table 1 based on CRP method

Embodiment based on the Word sense disambiguation method that word polyarch vector indicates.

Chinese corpus of the polysemant word sense disambiguation test data set in SemEval-2007#task5.Test data Concentration shares 40 ambiguity words: being divided into verb and noun, at least there are two the meaning of a word for the meaning of each word.Word sense disambiguation test The meaning of a word quantity of polysemant is different in data set, and majority is the 2-4 meaning of a word, and the most word of meaning of a word quantity " out " has 9 words Justice.Such as word " Chinese medicine ", there are two the meaning of a word, respectively " practitioner of Chinese medicine " and " traditional of Chinese medical science ", the meaning of a word are " doctor of traditional Chinese medicine " and " Chinese medicine medicine ", each word Justice respectively has the specific text example of unequal number amount.

In word sense disambiguation test case, based on the Word sense disambiguation method that word polyarch vector indicates, in test set Given each polysemant, the ambiguity word and its context extracted in text example indicates, calculates word polyarch vector table Show the similarity between the corresponding corpus of text clustering cluster mass center of each term vector, obtain polyarch word corresponding to polysemant to Amount indicates and its corresponding cluster classification, and polyarch term vector is indicated the standard that expressed meaning of a word classification and test set differentiates It is compared, differentiates the correctness for disambiguating result.

The noun word sense disambiguation result indicated based on word polyarch vector is as shown in Figure 5.Based on word polyarch vector The verb word sense disambiguation result of expression is as shown in Figure 6.

In information retrieval, word term vector polyarch is indicated and Word sense disambiguation method can identify ambiguity in retrieval object Word improves the accuracy that word indicates in the specific semanteme of context, and it is more reasonable to calculate, and search result is more accurate.

It is more similar as a result, can make with retrieval sequence of terms or keyword in order to recall in information retrieval application Similar sequence of terms or keyword are identified with similarity (sentence similarity, Words similarity).Words similarity, Ke Yitong The included angle cosine value of two word term vectors is crossed to measure the similitude of word.

For example, the term vector polyarch of ambiguity word " doing accounts " is expressed as " do accounts 1 " and " do accounts 2 ", the semanteme of " do accounts 1 " For the meaning of " doing accounts " or " calculate profit and loss ", the semanteme of " do accounts 2 " be " square of accounts after the autumn harvest " or " stand to lose or failure after again with people The meaning of trial of strength "." do accounts 1 " and the similarity of word " clearing " and " revenge " are respectively 0.66,0.11, " do accounts 2 " and word The similarity 0.14,0.72 of " clearing " and " revenge ".The similarity of " do accounts 1 " and " do accounts 2 " is 0.25, and word " doing accounts " is different Semanteme between similitude differ greatly.

In information retrieval, when retrieval object is sentence, sentence similarity can be used to measure retrieval object and retrieval Similitude between target.Searched targets are pre-processed, the sequence of terms of searched targets is obtained, word number is denoted as m, knows Ambiguity word in other sequence of terms, the term vector for obtaining each word in sequence of terms indicates, is denoted as set D.To retrieval pair The sentence of elephant is pre-processed, and the sequence of terms of retrieval sentence is obtained, and word number is denoted as n, identifies the ambiguity in sequence of terms Word, the term vector for obtaining each word in sequence of terms indicates, is denoted as set S.

Calculate separately the similarity sim (D in set D and set S between each word_i, S_j), extract most like m Word pair is retrieved the similarity Sim (D, S) between object S and target D, can be obtained by sentence similarity calculation formula:

Wherein,Indicate the sum of the similarity of m most like word pair, m is word in set D The number of language, n are the number of word in set S.

For example, searched targets include ambiguity word " doing accounts ", searched targets are sentence { him is looked for do accounts }, and retrieval object is sentence Sub 1 { telling a lie, other can look for you to do accounts }, sentence 2 { allowing them that cognition is gone to endanger in doing accounts }, after being pre-processed respectively To sequence of terms set D={ looking for him to do accounts }, S1={ telling a lie, other can look for you to do accounts }, S2={ allows them in doing accounts Cognition harm }.It identifies the ambiguity word in sequence of terms set D, S1, S2 and obtains the term vector expression of each word.Retrieval The similarity of each word is as shown in table 2 between target D and retrieval object S1, S2.

Similarity table in 2 searched targets of table and retrieval object between each word

It can be obtained by formula 3, Sim (D, S1)=0.62, Sim (D, S2)=0.39, searched targets D and sentence S1 are more Match, closer with authentic context, search result is more acurrate.

Disclosed above is only several specific embodiments of the invention, and still, the embodiment of the present invention is not limited to this, is appointed What what those skilled in the art can think variation should all fall into protection scope of the present invention.

Claims

1. the word polyarch vector based on CRP cluster indicates and Word sense disambiguation method, which comprises the steps of:

Step S1 carries out purification pretreatment to the text in mass text corpus and obtains plain text, based on CRP algorithm cluster text In this corpus target ambiguity word contextual window indicate, to corpus of text concentrate target ambiguity word according to clustering cluster Classification is marked, and the polyarch vector that training obtains ambiguity word on the corpus of text collection of label indicates；

Step S2 carries out the sequence of terms that pretreatment obtains short text to target short text, identifies that the target in sequence of terms is more Adopted word, the contextual window for calculating target ambiguity word indicate to concentrate each clustering cluster corresponding to the word with corpus of text Similarity between mass center, term vector corresponding to the clustering cluster classification by similarity maximum value indicates, as ambiguity word upper and lower The term vector of the specific meaning of a word indicates in text, carries out word sense disambiguation to polysemant.

2. the word polyarch as described in claim 1 based on CRP cluster indicates and Word sense disambiguation method, which is characterized in that Purification pretreatment is carried out to the text in mass text corpus described in step S1 and obtains plain text, comprising: it is few to delete number of words In the text of preset threshold；The complex form of Chinese characters is uniformly converted into simplified Chinese character；Using Chinese and English abbreviation dictionary, contract to corpus of text Chinese and English It writes and is replaced using Chinese word；The text concentrated to corpus of text segments；Remove stop words；Delete non-Chinese character With other characters outside number；Count word frequency；The word frequency of frequent words is preset as upper limit threshold；It selects corpus of text to concentrate to occur The word that number is greater than predetermined lower threshold value establishes word lists；Polysemant word lists are established based on polysemous dictionary.

3. the word polyarch vector as described in claim 1 based on CRP cluster indicates and Word sense disambiguation method, feature exist In the contextual window of target ambiguity word described in step S1 indicates that method is by the word of word in word context Vector averagely obtains, specific formula for calculation are as follows:

Wherein, veC is that the contextual window of word indicates, w_iFor i-th in word contextual window set of words Context Word, vec (w_i) it is word w_iInitial term vector.

4. the word polyarch vector as described in claim 1 based on CRP cluster indicates and Word sense disambiguation method, feature exist In the contextual window based on CRP algorithm cluster corpus of text concentration target ambiguity word described in step S1 indicates, indicates Method includes the following steps:

Step S102 obtains the initial clustering cluster mass center of CRP clustering algorithm, takes a random sample to cluster as CRP initial Clustering cluster mass center, or indicate to carry out initial clustering based on contextual window of the k-means algorithm to ambiguity word, it will be comprising most The clustering cluster mass center of multi-quantity sample is as initial clustering cluster mass center；

Step S103 indicates the contextual window of all samples of ambiguity word, for all clustering clusters, calculates every It is similar to the maximum between t-th of clustering cluster mass center to obtain i-th of sample for similarity between a sample and each clustering cluster mass center Spend Smax；If Smax is greater than preset threshold α, i-th of sample is divided into t-th of clustering cluster, the sample number in clustering cluster t Amount plus 1, recalculates the mass center of t-th of clustering cluster；Otherwise, new clustering cluster is generated, clustering cluster total number K increases by 1, new clustering cluster Middle sample size is 1, and the mass center of new clustering cluster is sample i；

5. the word polyarch vector as described in claim 1 based on CRP cluster indicates and Word sense disambiguation method, feature exist In the polyarch vector that training obtains ambiguity word on the corpus of text collection of label described in step S1 indicates, expression side Method includes the following steps:

Step S201 marks all samples of target ambiguity word described in corpus of text collection according to affiliated clustering cluster Note, different clustering clusters represent the different meaning of a word of target word；

Step S202 executes the word term vector expression based on neural network language model in the clustering cluster of label and trained Journey obtains the polyarch vector expression that word expresses the specific meaning of a word in different contexts.

6. the word polyarch vector as described in claim 1 based on CRP cluster indicates and Word sense disambiguation method, feature exist In, described in step S2 to polysemant carry out word sense disambiguation, include the following steps:

Step S301 pre-processes the target short text, obtains the sequence of terms of short text, according to the mostly former of word Type vector indicates to identify the ambiguity word in the sequence of terms；

Step S302 carries out word sense disambiguation to the ambiguity word, calculates context of the word in short text sequence of terms Window indicates and corpus of text concentrates the similarity between each clustering cluster mass center corresponding to the word, extracts similarity maximum Term vector corresponding to the clustering cluster classification of value indicates, expresses the term vector table of the specific meaning of a word within a context as ambiguity word Show.

7. the word polyarch vector as described in claim 1 based on CRP cluster indicates and Word sense disambiguation method, feature exist In, described in step S2 to target short text carry out pretreatment obtain short text sequence of terms, including removal stop words, traditional font Word is converted into simplified Chinese character；Using Chinese and English abbreviation dictionary, the english abbreviation in target short text is replaced using Chinese word It changes；Word segmentation processing is carried out to short text；Other characters outside non-Chinese character and number are replaced using additional character.