CN109033307A - Word polyarch vector based on CRP cluster indicates and Word sense disambiguation method - Google Patents

Word polyarch vector based on CRP cluster indicates and Word sense disambiguation method Download PDF

Info

Publication number
CN109033307A
CN109033307A CN201810783010.5A CN201810783010A CN109033307A CN 109033307 A CN109033307 A CN 109033307A CN 201810783010 A CN201810783010 A CN 201810783010A CN 109033307 A CN109033307 A CN 109033307A
Authority
CN
China
Prior art keywords
word
text
cluster
indicates
ambiguity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810783010.5A
Other languages
Chinese (zh)
Other versions
CN109033307B (en
Inventor
李国佳
郭鸿奇
杨喜亮
王国卿
杨振中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
North China University of Water Resources and Electric Power
Original Assignee
North China University of Water Resources and Electric Power
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by North China University of Water Resources and Electric Power filed Critical North China University of Water Resources and Electric Power
Priority to CN201810783010.5A priority Critical patent/CN109033307B/en
Publication of CN109033307A publication Critical patent/CN109033307A/en
Application granted granted Critical
Publication of CN109033307B publication Critical patent/CN109033307B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of word polyarch vector expression based on CRP cluster and Word sense disambiguation methods, including step 1: carrying out purification pretreatment to the text in mass text corpus and obtain plain text, the contextual window of target ambiguity word is concentrated to indicate based on CRP algorithm cluster corpus of text, concentrate target ambiguity word according to clustering cluster category label corpus of text, the polyarch vector that training obtains ambiguity word on the corpus of text collection of label indicates;Step 2: target short text is pre-processed and obtains short text sequence of terms, identify target ambiguity word in sequence of terms, it calculates the expression of target ambiguity word contextual window and corpus of text concentrates similarity between each clustering cluster mass center corresponding to the word, by term vector corresponding to similarity maximum value clustering cluster classification indicate as ambiguity word within a context the specific meaning of a word term vector indicate, to polysemant progress word sense disambiguation.The present invention solve the problems, such as word expression in polysemy indicate and the representation of word meaning in ambiguity identification problem.

Description

Word polyarch vector based on CRP cluster indicates and Word sense disambiguation method
Technical field
The present invention relates to natural language processing field, in particular to a kind of word polyarch vector table based on CRP cluster Show and Word sense disambiguation method.
Background technique
In numerous tasks of natural language processing field, the basic problem faced is how linguistic notation to be expressed as machine The manageable coding mode of device.Mapping expression is carried out to linguistic notation, word, sentence, text etc. are expressed as one continuously Low-dimensional vector, realize word, sentence, text semantic vectorization indicate, information retrieval, short text classification, name entity The tasks such as identification, sentiment analysis, recommended engine, automatic text summarization are widely used.
Word is the most basic component units of language, and the vectorization expression of word has wide in natural language processing task General application.The shortcomings that a kind of simple word vectors expression is One-hot Representation, this representation method be Vector dimension is equal to the number of all words, there are problems that dimension disaster, can not portray the semantic relation between word, simultaneously Different semantic meaning representations can not be reflected for ambiguity word.
The term vector of word indicates that (Word Embedding or Word Representation) is a kind of regular length Low-dimensional real vector indicates, is learnt by the training to mass text, and obtaining the unique vector of each word indicates that feature is phase Seemingly or relevant word is apart from upper closer.But due to the presence of ambiguity word in word, the same word symbol It may reflect that different semantemes, most of traditional word term vectors indicate only corresponding unique word in different context of co-texts Vector indicates, is unable to the different meaning of a word of effectively expressing ambiguity word.Each meaning of a word of ambiguity word should corresponding one to Amount indicates.
Word polyarch vector indicates that corresponding to a term vector for each meaning of a word of ambiguity word indicates, can improve word The accuracy that language indicates.The vector for obtaining the word difference meaning of a word indicates, usually using the model based on cluster, by clustering word Context carries out meaning of a word conclusion, is directly clustered to the context of original text word or utilized across linguistry progress semantic It is clustered after mapping, retraining obtains the corresponding term vector expression of word specific meaning of a word in different context of co-texts.
The side that polysemant words and phrases vector indicates is obtained based on k-means clustering algorithm and the training of neural network language model The size of method, parameter k (cluster classification) needs to select different numerical value according to polysemant words and phrases justice number.And it is poly- based on CRP The word polyarch vector of class indicates that training process does not need specified cluster class number in advance, meets different ambiguity words and exists The inconsistent actual conditions of meaning of a word number in context.
The word representation of word meaning of high quality can capture semantic and syntactic information abundant, facilitate word sense disambiguation.High quality Word sense disambiguation can preferably learn the expression of the word meaning of a word.Word sense disambiguation main method has two classes: being based on external knowledge library side Method and method based on corpus.Based on external knowledge library method, by external knowledge library (WordNet or HowNet) to word It is specifically semantic to carry out discrimination identification ambiguity word for different semantic explanations or description, but the building of external knowledge library or dictionary needs Expend a large amount of manpower and material resources.Method based on corpus passes through automatically or semi-automatically using corpus as knowledge resource It practises and determines the specific meaning of a word of word in a given context, to realize word sense disambiguation.
To the ambiguity word in sentence, using text corpus, the word polyarch vector based on acquisition is indicated, by giving Word sense disambiguation method out obtains the word specific meaning of a word within a context, helps to improve the expression efficiency of word and sentence.
Internet technology and mobile application gradually popularize daily life, people using mobile terminal carry out information transmitting and Communication becomes increasingly prevalent, and thereby produces the data of magnanimity, such as headline, micro-blog information, shopping platform Commodity or service describing, forum's comment, intelligent interaction application and social conversation message etc., these data are usually by text structure At it is a kind of typical short text form that length is shorter, and this short text data contains the information of a large amount of high values, is had very High researching value.It is effectively handled using short text data of the machine to magnanimity on internet and understanding has become nature The important Research Challenges and hot spot of Language Processing and machine learning field.
In the similarity calculation of information retrieval, word polyarch vector is indicated and Word sense disambiguation method can distinguish retrieval The specific meaning of a word of ambiguity word in object improves the accuracy that word is indicated and calculated.For the short text in information retrieval field Retrieval provides a kind of effective phrase semantic and indicates and Word sense disambiguation method, provides technical support for semantic computation.
Summary of the invention
The purpose of the present invention is overcoming above-mentioned problems of the prior art, a kind of word based on CRP cluster is provided Polyarch vector indicates and Word sense disambiguation method, and word polyarch vector indicates corresponding for each meaning of a word of ambiguity word One term vector indicates solve the problems, such as the expression of polysemy in word expression, indicate based on word polyarch vector Word sense disambiguation method solves the problems, such as the identification of the ambiguity in the representation of word meaning.
The technical scheme is that the word polyarch vector based on CRP cluster indicates and Word sense disambiguation method, including Following steps:
Step S1 carries out purification pretreatment to the text in mass text corpus and obtains plain text, poly- based on CRP algorithm The contextual window of target ambiguity word indicates in class text corpus, and the target ambiguity word concentrated to corpus of text is according to poly- Class cluster classification is marked, and the polyarch vector that training obtains ambiguity word on the corpus of text collection of label indicates;
Step S2 carries out the sequence of terms that pretreatment obtains short text to target short text, identifies the mesh in sequence of terms Ambiguity word is marked, the contextual window for calculating target ambiguity word indicates to concentrate with corpus of text each poly- corresponding to the word Similarity between class cluster mass center, term vector corresponding to the clustering cluster classification by similarity maximum value indicates, exists as ambiguity word The term vector of the specific meaning of a word indicates in context, carries out word sense disambiguation to polysemant.
Purification pretreatment is carried out to the text in mass text corpus described in above-mentioned steps S1 and obtains plain text, packet It includes: deleting the text that number of words is less than preset threshold;The complex form of Chinese characters is uniformly converted into simplified Chinese character;Using Chinese and English abbreviation dictionary, to text The abbreviation of this corpus Chinese and English is replaced using Chinese word;The text concentrated to corpus of text segments;Remove stop words; Delete other characters outside non-Chinese character and number;Count word frequency;The word frequency of frequent words is preset as upper limit threshold;Selection text The word that frequency of occurrence is greater than predetermined lower threshold value in this corpus establishes word lists;Polysemant word is established based on polysemous dictionary Language table.
The contextual window of target ambiguity word described in above-mentioned steps S1 indicates that method is by word context The term vector of word averagely obtains, specific formula for calculation are as follows:
Wherein, veC is that the contextual window of word indicates, wiFor in word contextual window set of words Context I-th of word, vec (wi) it is word wiInitial term vector.
The contextual window table of target ambiguity word is concentrated described in above-mentioned steps S1 based on CRP algorithm cluster corpus of text Show, representation method includes the following steps:
Step S101 obtains the ambiguity word and concentrates the contextual window of all samples to indicate in corpus of text;
Step S102 obtains the initial clustering cluster mass center of CRP clustering algorithm, a random sample is taken to cluster as CRP Initial clustering cluster mass center, or indicate to carry out initial clustering based on contextual window of the k-means algorithm to ambiguity word, it will wrap Clustering cluster mass center containing most number of samples is as initial clustering cluster mass center;
Step S103 indicates the contextual window of all samples of ambiguity word, for all clustering clusters, meter The similarity between each sample and each clustering cluster mass center is calculated, the maximum between i-th of sample and t-th of clustering cluster mass center is obtained Similarity Smax;If Smax is greater than preset threshold α, i-th of sample is divided into t-th of clustering cluster, the sample in clustering cluster t This quantity adds 1, recalculates the mass center of t-th of clustering cluster;Otherwise, new clustering cluster is generated, clustering cluster total number K increases by 1, newly gathers Sample size is 1 in class cluster, and the mass center of new clustering cluster is sample i;
Step S104 obtains sample, the mass center of clustering cluster and the sum of clustering cluster in each clustering cluster.
Training obtains the polyarch vector expression of ambiguity word on the corpus of text collection of label described in above-mentioned steps S1, Its representation method includes the following steps:
Step S201, to all samples of target ambiguity word described in corpus of text collection, according to affiliated clustering cluster into Line flag, different clustering clusters represent the different meaning of a word of target word;
Step S202, executing the word term vector based on neural network language model in the clustering cluster of label indicates training Process obtains the polyarch vector expression that word expresses the specific meaning of a word in different contexts.
Word sense disambiguation is carried out to polysemant described in above-mentioned steps S2, is included the following steps:
Step S301 pre-processes the target short text, the sequence of terms of short text is obtained, according to word Polyarch vector indicates to identify the ambiguity word in the sequence of terms;
Step S302 carries out word sense disambiguation to the ambiguity word, it is upper in short text sequence of terms to calculate word Hereafter window indicates and corpus of text concentrates the similarity between each clustering cluster mass center corresponding to the word, extracts similarity Term vector corresponding to the clustering cluster classification of maximum value indicates, as ambiguity word express within a context the word of the specific meaning of a word to Amount indicates.
The sequence of terms that pretreatment obtains short text is carried out to target short text described in above-mentioned steps S2, including removal stops Word, the complex form of Chinese characters are converted into simplified Chinese character;Using Chinese and English abbreviation dictionary, Chinese word is used to the english abbreviation in target short text Language is replaced;Word segmentation processing is carried out to short text;Other characters outside non-Chinese character and number are replaced using additional character.
Beneficial effects of the present invention: in the embodiment of the present invention, a kind of word polyarch vector table based on CRP cluster is provided Show and Word sense disambiguation method, the contextual window table of target word is concentrated using the clustering algorithm cluster corpus of text based on CRP Show, in the clustering cluster of label training obtain ambiguity word term vector indicate, improve ambiguity word vector table show it is accurate Degree solves the problems, such as the expression of polysemy in word expression.To the ambiguity word in sentence, the polyarch vector table of word is utilized Show, the similarity in the contextual window expression and training sample by calculating ambiguity word between the word cluster cluster mass center will Term vector corresponding to the clustering cluster of similarity maximum value indicates, the term vector as ambiguity word certain semantic within a context It indicates, eliminates the ambiguousness of ambiguity word.
Word polyarch vector representation method proposed by the present invention based on CRP cluster, using the clustering algorithm based on CRP The context for clustering all samples of target ambiguity word indicates that a clustering cluster result represents the semanteme of target word one kind, Training obtains word polyarch vector and indicates in the clustering cluster corpus of label.The polyarch vector expression of word can distinguish table Show the different meaning of a word of ambiguity word, solves the problems, such as the expression of polysemy.
The present invention indicates to carry out using contextual window of the clustering algorithm based on CRP to all samples of target ambiguity word Cluster, CRP algorithm cluster do not need specified cluster number in advance, and the clustering cluster number energy effectively expressing ambiguity word of acquisition is not With the quantity of the meaning of a word, solves the inconsistent practical problem of different polysemant words and phrases justice numbers, utilize word contextual window table That shows belongs to the judgment criteria of same clustering cluster as word, and calculating process is simple.
The Word sense disambiguation method proposed by the present invention indicated based on word polyarch vector, can identify the ambiguity in sentence Word, and obtain word within a context the specific meaning of a word term vector indicate, eliminate ambiguity word in different context of co-texts In ambiguousness.The contextual window for calculating target ambiguity word indicates to concentrate with corpus of text each poly- corresponding to the word Similarity between class cluster mass center, term vector corresponding to the clustering cluster classification by similarity maximum value indicates, exists as ambiguity word The term vector of the specific meaning of a word indicates in context, has carried out word sense disambiguation to ambiguity word.
Detailed description of the invention
Fig. 1 is based on the word polyarch vector expression of CRP cluster and the overall flow figure of word sense disambiguation;
Fig. 2 is the flow chart indicated based on CRP cluster ambiguity word contextual window;
Fig. 3 is the training process that the word polyarch vector based on CRP cluster indicates;
Fig. 4 is the word sense disambiguation flow chart indicated based on word polyarch vector;
Fig. 5 is the noun word sense disambiguation result indicated based on word polyarch vector;
Fig. 6 is the verb word sense disambiguation result indicated based on word polyarch vector.
Specific embodiment
With reference to the accompanying drawing, the specific embodiment of the present invention is described in detail, it is to be understood that of the invention Protection scope be not limited by the specific implementation.
The invention discloses a kind of word polyarch vector expression based on CRP cluster and Word sense disambiguation methods, such as Fig. 1 institute Show, basic ideas of the invention are the polyarch vectors that word is constructed on the basis of indicating based on CRP cluster word context It indicates, identifies the ambiguity word in sentence or short text, eliminate the ambiguousness of ambiguity word, it is specific within a context to obtain word The term vector of the meaning of a word indicates that the polyarch expression of term vector being capable of different languages of the more Precise Representation word in context of co-text Justice.The specific steps of the present invention are as follows:
In step sl, purification pretreatment is carried out to the text in mass text corpus and obtains plain text: to disclosed Or the corpus of text collection that acquisition obtains, delete the text that number of words is less than preset threshold;The complex form of Chinese characters that corpus of text is concentrated turns Turn to simplified Chinese character;The english abbreviation that corpus of text is concentrated is replaced using Chinese word using Custom Dictionaries;Then it adopts Word segmentation processing is carried out with Words partition system;Other characters unless outside Chinese character and number are removed using the matched method of canonical;It goes Except stop words and count word frequency;The word frequency of frequent words is preset as upper limit threshold;Occurrence is finally concentrated out according to corpus of text The word that number is greater than predetermined lower threshold value establishes word lists.
In step sl, obtain ambiguity word concentrates the contextual window of all samples to indicate that window is big in corpus of text Small to be set as a positive integer, each contextual window indicates the term vector weighted calculation by word in window;
In step sl, as shown in Fig. 2, the contextual window based on CRP algorithm cluster all samples of target ambiguity word It indicates, specifically:
1. indicating to carry out initial clustering based on contextual window of the k-means algorithm to ambiguity word, each cluster is obtained Cluster and its mass center;
2. using the clustering cluster mass center comprising most number of samples as the initial clustering cluster mass center of CRP clustering algorithm;
3. the contextual window of pair all samples of ambiguity word indicates, for all clustering clusters, calculate each sample and Similarity between each clustering cluster mass center obtains the maximum similarity Smax between i-th of sample and t-th of clustering cluster mass center;
4. if i-th of sample is divided into t-th of clustering cluster, the sample in clustering cluster t Smax is greater than preset threshold α This quantity adds 1, recalculates the mass center of t-th of clustering cluster.Otherwise, new clustering cluster is generated, clustering cluster total number K increases by 1, newly gathers Sample size is 1 in class cluster, and the mass center of new clustering cluster is sample i;
5. obtaining sample, the mass center of clustering cluster and the sum of clustering cluster in each clustering cluster.
Wherein, the 1st, 2 steps can simplify using first sample or to take a random sample as CRP and clustering and is initial Clustering cluster mass center.
In step sl, as shown in figure 3, the polyarch vector that training obtains word indicates, specifically:
All contextual windows are concentrated to indicate in corpus of text 1. obtaining target ambiguity word;
2. the contextual window based on CRP algorithm cluster ambiguity word indicates, the clustering cluster of word context expression is obtained;
3. pair target ambiguity word, finds accordingly in urtext corpus according to target word and its context Position carries out corresponding category label in target text corpus according to clustering cluster belonging to sample, and different clustering clusters represents The different semanteme of target word;
4. pair each ambiguity word, execution step 1,2,3, by the category label of clustering cluster to target text corpus In.
5. the polyarch vector for obtaining word based on CBOW model training on the corpus of text collection of label indicates.
In step s 2, as shown in figure 4, based on ambiguity words recognition and word sense disambiguation that word polyarch vector indicates, Specifically:
1. a pair target short text pre-processes, specifically include: removal stop words, the complex form of Chinese characters are converted into simplified Chinese character;It utilizes Chinese and English abbreviation dictionary, the english abbreviation in target sentences is replaced using Chinese word;Short text is carried out at participle Reason;Other characters outside non-Chinese character and number are replaced using additional character, obtain the sequence of terms of short text.
2. identifying the word of ambiguity in sentence.The word of ambiguity in identification sequence of terms is indicated according to word polyarch vector Language, ambiguity word there are two or more term vector indicate.
3. the contextual window for calculating ambiguity word indicates.Contextual window indicates adding by the term vector of context words Weight average value indicates, for the ambiguity word occurred in context words, concentrates the word frequency of occurrence using in corpus of text Term vector corresponding to most clustering clusters is as the term vector for participating in calculating, to unidentified word using word in contextual window The average value of words and phrases vector is indicated.
4. pair ambiguity word, according to phrase semantic quantity number, according to more sequences after first few successively to ambiguity word It is disambiguated.
5. calculate ambiguity word indicates between the mass center of training sample clustering cluster in the contextual window in short text sequence Similarity, by term vector corresponding to the clustering cluster of similarity maximum value indicate as ambiguity word term vector expression.
According to the thought that is determined by its context of semanteme of word, ambiguity word certain semantic within a context passes through meter The contextual window for calculating ambiguity word indicates between the corresponding corpus of text clustering cluster mass center of each term vector of ambiguity word Similarity obtains, and term vector corresponding to similarity maximum value is indicated as ambiguity word certain semantic within a context Term vector expression, circular are as follows:
Vec (w)={ veck(w)|k,Sim(veC,veck(w))=Max (Sim (veC, vecj(w)))} (2)
Wherein, vec (w) is that ambiguity word w corresponding certain semantic term vector in contextual window indicates, vecj(w) For the term vector expression of j-th of semantic corresponding corpus of text clustering cluster mass center of ambiguity word w, Max (Sim (veC, vecj (w)) veC and each vec) are indicated for ambiguity word w contextual windowj(w) maximum value of similarity, by maximum value corresponding K term vector indicates the term vector as word w certain semantic.
Term is explained: the abbreviation of CRP:Chinese Restaurant Process, and Chinese is " Chinese restaurant's mistake Journey " is typical Dirichlet (Di Li Cray) process mixed model, its advantage is that establishing the number of mixed model classification Mesh is without artificial specified, the clustering problem being suitble in natural language processing.
The term vector polyarch of ambiguity word multiple meaning of a word in different contexts indicates.In table 1, the word pair of no label The expression of word term vector, such as " apple " are answered, is the word for not distinguishing ambiguity.How former the specific meaning of a word corresponding word of word is Type vector indicates, for example, " apple 2 " indicates the 2nd meaning of a word of word " apple ", refers to the apple of agricultural product." apple 1 " Term vector indicates corresponding as IT company with it, and " apple 2 " then indicates it as a kind of meaning of fruit.Word polyarch to Amount indicates that the semantic information of difference word can be captured.
The most close word of word or the meaning of a word of the table 1 based on CRP method
Embodiment based on the Word sense disambiguation method that word polyarch vector indicates.
Chinese corpus of the polysemant word sense disambiguation test data set in SemEval-2007#task5.Test data Concentration shares 40 ambiguity words: being divided into verb and noun, at least there are two the meaning of a word for the meaning of each word.Word sense disambiguation test The meaning of a word quantity of polysemant is different in data set, and majority is the 2-4 meaning of a word, and the most word of meaning of a word quantity " out " has 9 words Justice.Such as word " Chinese medicine ", there are two the meaning of a word, respectively " practitioner of Chinese medicine " and " traditional of Chinese medical science ", the meaning of a word are " doctor of traditional Chinese medicine " and " Chinese medicine medicine ", each word Justice respectively has the specific text example of unequal number amount.
In word sense disambiguation test case, based on the Word sense disambiguation method that word polyarch vector indicates, in test set Given each polysemant, the ambiguity word and its context extracted in text example indicates, calculates word polyarch vector table Show the similarity between the corresponding corpus of text clustering cluster mass center of each term vector, obtain polyarch word corresponding to polysemant to Amount indicates and its corresponding cluster classification, and polyarch term vector is indicated the standard that expressed meaning of a word classification and test set differentiates It is compared, differentiates the correctness for disambiguating result.
The noun word sense disambiguation result indicated based on word polyarch vector is as shown in Figure 5.Based on word polyarch vector The verb word sense disambiguation result of expression is as shown in Figure 6.
In information retrieval, word term vector polyarch is indicated and Word sense disambiguation method can identify ambiguity in retrieval object Word improves the accuracy that word indicates in the specific semanteme of context, and it is more reasonable to calculate, and search result is more accurate.
It is more similar as a result, can make with retrieval sequence of terms or keyword in order to recall in information retrieval application Similar sequence of terms or keyword are identified with similarity (sentence similarity, Words similarity).Words similarity, Ke Yitong The included angle cosine value of two word term vectors is crossed to measure the similitude of word.
For example, the term vector polyarch of ambiguity word " doing accounts " is expressed as " do accounts 1 " and " do accounts 2 ", the semanteme of " do accounts 1 " For the meaning of " doing accounts " or " calculate profit and loss ", the semanteme of " do accounts 2 " be " square of accounts after the autumn harvest " or " stand to lose or failure after again with people The meaning of trial of strength "." do accounts 1 " and the similarity of word " clearing " and " revenge " are respectively 0.66,0.11, " do accounts 2 " and word The similarity 0.14,0.72 of " clearing " and " revenge ".The similarity of " do accounts 1 " and " do accounts 2 " is 0.25, and word " doing accounts " is different Semanteme between similitude differ greatly.
In information retrieval, when retrieval object is sentence, sentence similarity can be used to measure retrieval object and retrieval Similitude between target.Searched targets are pre-processed, the sequence of terms of searched targets is obtained, word number is denoted as m, knows Ambiguity word in other sequence of terms, the term vector for obtaining each word in sequence of terms indicates, is denoted as set D.To retrieval pair The sentence of elephant is pre-processed, and the sequence of terms of retrieval sentence is obtained, and word number is denoted as n, identifies the ambiguity in sequence of terms Word, the term vector for obtaining each word in sequence of terms indicates, is denoted as set S.
Calculate separately the similarity sim (D in set D and set S between each wordi, Sj), extract most like m Word pair is retrieved the similarity Sim (D, S) between object S and target D, can be obtained by sentence similarity calculation formula:
Wherein,Indicate the sum of the similarity of m most like word pair, m is word in set D The number of language, n are the number of word in set S.
For example, searched targets include ambiguity word " doing accounts ", searched targets are sentence { him is looked for do accounts }, and retrieval object is sentence Sub 1 { telling a lie, other can look for you to do accounts }, sentence 2 { allowing them that cognition is gone to endanger in doing accounts }, after being pre-processed respectively To sequence of terms set D={ looking for him to do accounts }, S1={ telling a lie, other can look for you to do accounts }, S2={ allows them in doing accounts Cognition harm }.It identifies the ambiguity word in sequence of terms set D, S1, S2 and obtains the term vector expression of each word.Retrieval The similarity of each word is as shown in table 2 between target D and retrieval object S1, S2.
Similarity table in 2 searched targets of table and retrieval object between each word
It can be obtained by formula 3, Sim (D, S1)=0.62, Sim (D, S2)=0.39, searched targets D and sentence S1 are more Match, closer with authentic context, search result is more acurrate.
Disclosed above is only several specific embodiments of the invention, and still, the embodiment of the present invention is not limited to this, is appointed What what those skilled in the art can think variation should all fall into protection scope of the present invention.

Claims (7)

1. the word polyarch vector based on CRP cluster indicates and Word sense disambiguation method, which comprises the steps of:
Step S1 carries out purification pretreatment to the text in mass text corpus and obtains plain text, based on CRP algorithm cluster text In this corpus target ambiguity word contextual window indicate, to corpus of text concentrate target ambiguity word according to clustering cluster Classification is marked, and the polyarch vector that training obtains ambiguity word on the corpus of text collection of label indicates;
Step S2 carries out the sequence of terms that pretreatment obtains short text to target short text, identifies that the target in sequence of terms is more Adopted word, the contextual window for calculating target ambiguity word indicate to concentrate each clustering cluster corresponding to the word with corpus of text Similarity between mass center, term vector corresponding to the clustering cluster classification by similarity maximum value indicates, as ambiguity word upper and lower The term vector of the specific meaning of a word indicates in text, carries out word sense disambiguation to polysemant.
2. the word polyarch as described in claim 1 based on CRP cluster indicates and Word sense disambiguation method, which is characterized in that Purification pretreatment is carried out to the text in mass text corpus described in step S1 and obtains plain text, comprising: it is few to delete number of words In the text of preset threshold;The complex form of Chinese characters is uniformly converted into simplified Chinese character;Using Chinese and English abbreviation dictionary, contract to corpus of text Chinese and English It writes and is replaced using Chinese word;The text concentrated to corpus of text segments;Remove stop words;Delete non-Chinese character With other characters outside number;Count word frequency;The word frequency of frequent words is preset as upper limit threshold;It selects corpus of text to concentrate to occur The word that number is greater than predetermined lower threshold value establishes word lists;Polysemant word lists are established based on polysemous dictionary.
3. the word polyarch vector as described in claim 1 based on CRP cluster indicates and Word sense disambiguation method, feature exist In the contextual window of target ambiguity word described in step S1 indicates that method is by the word of word in word context Vector averagely obtains, specific formula for calculation are as follows:
Wherein, veC is that the contextual window of word indicates, wiFor i-th in word contextual window set of words Context Word, vec (wi) it is word wiInitial term vector.
4. the word polyarch vector as described in claim 1 based on CRP cluster indicates and Word sense disambiguation method, feature exist In the contextual window based on CRP algorithm cluster corpus of text concentration target ambiguity word described in step S1 indicates, indicates Method includes the following steps:
Step S101 obtains the ambiguity word and concentrates the contextual window of all samples to indicate in corpus of text;
Step S102 obtains the initial clustering cluster mass center of CRP clustering algorithm, takes a random sample to cluster as CRP initial Clustering cluster mass center, or indicate to carry out initial clustering based on contextual window of the k-means algorithm to ambiguity word, it will be comprising most The clustering cluster mass center of multi-quantity sample is as initial clustering cluster mass center;
Step S103 indicates the contextual window of all samples of ambiguity word, for all clustering clusters, calculates every It is similar to the maximum between t-th of clustering cluster mass center to obtain i-th of sample for similarity between a sample and each clustering cluster mass center Spend Smax;If Smax is greater than preset threshold α, i-th of sample is divided into t-th of clustering cluster, the sample number in clustering cluster t Amount plus 1, recalculates the mass center of t-th of clustering cluster;Otherwise, new clustering cluster is generated, clustering cluster total number K increases by 1, new clustering cluster Middle sample size is 1, and the mass center of new clustering cluster is sample i;
Step S104 obtains sample, the mass center of clustering cluster and the sum of clustering cluster in each clustering cluster.
5. the word polyarch vector as described in claim 1 based on CRP cluster indicates and Word sense disambiguation method, feature exist In the polyarch vector that training obtains ambiguity word on the corpus of text collection of label described in step S1 indicates, expression side Method includes the following steps:
Step S201 marks all samples of target ambiguity word described in corpus of text collection according to affiliated clustering cluster Note, different clustering clusters represent the different meaning of a word of target word;
Step S202 executes the word term vector expression based on neural network language model in the clustering cluster of label and trained Journey obtains the polyarch vector expression that word expresses the specific meaning of a word in different contexts.
6. the word polyarch vector as described in claim 1 based on CRP cluster indicates and Word sense disambiguation method, feature exist In, described in step S2 to polysemant carry out word sense disambiguation, include the following steps:
Step S301 pre-processes the target short text, obtains the sequence of terms of short text, according to the mostly former of word Type vector indicates to identify the ambiguity word in the sequence of terms;
Step S302 carries out word sense disambiguation to the ambiguity word, calculates context of the word in short text sequence of terms Window indicates and corpus of text concentrates the similarity between each clustering cluster mass center corresponding to the word, extracts similarity maximum Term vector corresponding to the clustering cluster classification of value indicates, expresses the term vector table of the specific meaning of a word within a context as ambiguity word Show.
7. the word polyarch vector as described in claim 1 based on CRP cluster indicates and Word sense disambiguation method, feature exist In, described in step S2 to target short text carry out pretreatment obtain short text sequence of terms, including removal stop words, traditional font Word is converted into simplified Chinese character;Using Chinese and English abbreviation dictionary, the english abbreviation in target short text is replaced using Chinese word It changes;Word segmentation processing is carried out to short text;Other characters outside non-Chinese character and number are replaced using additional character.
CN201810783010.5A 2018-07-17 2018-07-17 CRP clustering-based word multi-prototype vector representation and word sense disambiguation method Active CN109033307B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810783010.5A CN109033307B (en) 2018-07-17 2018-07-17 CRP clustering-based word multi-prototype vector representation and word sense disambiguation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810783010.5A CN109033307B (en) 2018-07-17 2018-07-17 CRP clustering-based word multi-prototype vector representation and word sense disambiguation method

Publications (2)

Publication Number Publication Date
CN109033307A true CN109033307A (en) 2018-12-18
CN109033307B CN109033307B (en) 2021-08-31

Family

ID=64643470

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810783010.5A Active CN109033307B (en) 2018-07-17 2018-07-17 CRP clustering-based word multi-prototype vector representation and word sense disambiguation method

Country Status (1)

Country Link
CN (1) CN109033307B (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109740162A (en) * 2019-01-09 2019-05-10 安徽省泰岳祥升软件有限公司 Document representation method, device and medium
CN109783806A (en) * 2018-12-21 2019-05-21 众安信息技术服务有限公司 A kind of text matching technique using semantic analytic structure
CN109960799A (en) * 2019-03-12 2019-07-02 中南大学 A kind of Optimum Classification method towards short text
CN110309515A (en) * 2019-07-10 2019-10-08 北京奇艺世纪科技有限公司 Entity recognition method and device
CN110532395A (en) * 2019-05-13 2019-12-03 南京大学 A kind of method for building up of the term vector improved model based on semantic embedding
CN110705274A (en) * 2019-09-06 2020-01-17 电子科技大学 Fusion type word meaning embedding method based on real-time learning
CN110717015A (en) * 2019-10-10 2020-01-21 大连理工大学 Neural network-based polysemous word recognition method
CN110765781A (en) * 2019-12-11 2020-02-07 沈阳航空航天大学 Man-machine collaborative construction method for domain term semantic knowledge base
CN111159337A (en) * 2019-12-20 2020-05-15 中国建设银行股份有限公司 Chemical expression extraction method, device and equipment
CN111310475A (en) * 2020-02-04 2020-06-19 支付宝(杭州)信息技术有限公司 Training method and device of word sense disambiguation model
CN111414523A (en) * 2020-03-11 2020-07-14 中国建设银行股份有限公司 Data acquisition method and device
CN111507098A (en) * 2020-04-17 2020-08-07 腾讯科技(深圳)有限公司 Ambiguous word recognition method and device, electronic equipment and computer-readable storage medium
CN111523312A (en) * 2020-04-22 2020-08-11 南京贝湾信息科技有限公司 Paraphrase disambiguation-based query display method and device and computing equipment
CN111783418A (en) * 2020-06-09 2020-10-16 北京北大软件工程股份有限公司 Chinese meaning representation learning method and device
CN111914569A (en) * 2020-08-10 2020-11-10 哈尔滨安天科技集团股份有限公司 Prediction method and device based on fusion map, electronic equipment and storage medium
CN112579769A (en) * 2019-09-30 2021-03-30 北京国双科技有限公司 Keyword clustering method and device, storage medium and electronic equipment
CN113723116A (en) * 2021-08-25 2021-11-30 科大讯飞股份有限公司 Text translation method and related device, electronic equipment and storage medium
CN113723101A (en) * 2021-09-09 2021-11-30 国网电子商务有限公司 Word sense disambiguation method and device applied to intention recognition
CN113761196A (en) * 2021-07-28 2021-12-07 北京中科模识科技有限公司 Text clustering method and system, electronic device and storage medium
CN114943235A (en) * 2022-07-12 2022-08-26 长安大学 Named entity recognition method based on multi-class language model

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080065623A1 (en) * 2006-09-08 2008-03-13 Microsoft Corporation Person disambiguation using name entity extraction-based clustering
US20140214840A1 (en) * 2010-11-29 2014-07-31 Google Inc. Name Disambiguation Using Context Terms
CN103970729A (en) * 2014-04-29 2014-08-06 河海大学 Multi-subject extracting method based on semantic categories
CN104008090A (en) * 2014-04-29 2014-08-27 河海大学 Multi-subject extraction method based on concept vector model
CN104731771A (en) * 2015-03-27 2015-06-24 大连理工大学 Term vector-based abbreviation ambiguity elimination system and method
CN104778158A (en) * 2015-03-04 2015-07-15 新浪网技术(中国)有限公司 Method and device for representing text
CN104778186A (en) * 2014-01-15 2015-07-15 阿里巴巴集团控股有限公司 Method and system for hanging commodity object to standard product unit (SPU)
US20160292149A1 (en) * 2014-08-02 2016-10-06 Google Inc. Word sense disambiguation using hypernyms
CN106598947A (en) * 2016-12-15 2017-04-26 山西大学 Bayesian word sense disambiguation method based on synonym expansion
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080065623A1 (en) * 2006-09-08 2008-03-13 Microsoft Corporation Person disambiguation using name entity extraction-based clustering
US20140214840A1 (en) * 2010-11-29 2014-07-31 Google Inc. Name Disambiguation Using Context Terms
CN104778186A (en) * 2014-01-15 2015-07-15 阿里巴巴集团控股有限公司 Method and system for hanging commodity object to standard product unit (SPU)
CN103970729A (en) * 2014-04-29 2014-08-06 河海大学 Multi-subject extracting method based on semantic categories
CN104008090A (en) * 2014-04-29 2014-08-27 河海大学 Multi-subject extraction method based on concept vector model
US20160292149A1 (en) * 2014-08-02 2016-10-06 Google Inc. Word sense disambiguation using hypernyms
CN104778158A (en) * 2015-03-04 2015-07-15 新浪网技术(中国)有限公司 Method and device for representing text
CN104731771A (en) * 2015-03-27 2015-06-24 大连理工大学 Term vector-based abbreviation ambiguity elimination system and method
CN106598947A (en) * 2016-12-15 2017-04-26 山西大学 Bayesian word sense disambiguation method based on synonym expansion
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BASILI,R等: "Contextual word sense tuning and disambiguation", 《APPLIED ARTIFICIAL INTELLIGENCE》 *
张晗: "融合句义特征的人名消歧及人物关系抽取技术研究", 《中国优秀硕士学位论文全文数据库(电子期刊)》 *
王瑞琴等: "无监督词义消歧研究", 《软件学报》 *
郭鸿奇等: "一种基于词语多原型向量表示的句子相似度计算方法", 《智能计算机与应用》 *

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109783806A (en) * 2018-12-21 2019-05-21 众安信息技术服务有限公司 A kind of text matching technique using semantic analytic structure
CN109783806B (en) * 2018-12-21 2023-05-02 众安信息技术服务有限公司 Text matching method utilizing semantic parsing structure
CN109740162A (en) * 2019-01-09 2019-05-10 安徽省泰岳祥升软件有限公司 Document representation method, device and medium
CN109740162B (en) * 2019-01-09 2023-07-11 安徽省泰岳祥升软件有限公司 Text representation method, device and medium
CN109960799A (en) * 2019-03-12 2019-07-02 中南大学 A kind of Optimum Classification method towards short text
CN110532395B (en) * 2019-05-13 2021-09-28 南京大学 Semantic embedding-based word vector improvement model establishing method
CN110532395A (en) * 2019-05-13 2019-12-03 南京大学 A kind of method for building up of the term vector improved model based on semantic embedding
CN110309515B (en) * 2019-07-10 2023-08-11 北京奇艺世纪科技有限公司 Entity identification method and device
CN110309515A (en) * 2019-07-10 2019-10-08 北京奇艺世纪科技有限公司 Entity recognition method and device
CN110705274B (en) * 2019-09-06 2023-03-24 电子科技大学 Fusion type word meaning embedding method based on real-time learning
CN110705274A (en) * 2019-09-06 2020-01-17 电子科技大学 Fusion type word meaning embedding method based on real-time learning
CN112579769A (en) * 2019-09-30 2021-03-30 北京国双科技有限公司 Keyword clustering method and device, storage medium and electronic equipment
CN110717015A (en) * 2019-10-10 2020-01-21 大连理工大学 Neural network-based polysemous word recognition method
CN110765781B (en) * 2019-12-11 2023-07-14 沈阳航空航天大学 Man-machine collaborative construction method for domain term semantic knowledge base
CN110765781A (en) * 2019-12-11 2020-02-07 沈阳航空航天大学 Man-machine collaborative construction method for domain term semantic knowledge base
CN111159337A (en) * 2019-12-20 2020-05-15 中国建设银行股份有限公司 Chemical expression extraction method, device and equipment
CN111310475B (en) * 2020-02-04 2023-03-10 支付宝(杭州)信息技术有限公司 Training method and device of word sense disambiguation model
CN111310475A (en) * 2020-02-04 2020-06-19 支付宝(杭州)信息技术有限公司 Training method and device of word sense disambiguation model
CN111414523A (en) * 2020-03-11 2020-07-14 中国建设银行股份有限公司 Data acquisition method and device
CN111507098A (en) * 2020-04-17 2020-08-07 腾讯科技(深圳)有限公司 Ambiguous word recognition method and device, electronic equipment and computer-readable storage medium
CN111507098B (en) * 2020-04-17 2023-03-21 腾讯科技(深圳)有限公司 Ambiguous word recognition method and device, electronic equipment and computer-readable storage medium
CN111523312A (en) * 2020-04-22 2020-08-11 南京贝湾信息科技有限公司 Paraphrase disambiguation-based query display method and device and computing equipment
CN111523312B (en) * 2020-04-22 2023-06-16 南京贝湾信息科技有限公司 Word searching display method and device based on paraphrasing disambiguation and computing equipment
CN111783418A (en) * 2020-06-09 2020-10-16 北京北大软件工程股份有限公司 Chinese meaning representation learning method and device
CN111783418B (en) * 2020-06-09 2024-04-05 北京北大软件工程股份有限公司 Chinese word meaning representation learning method and device
CN111914569A (en) * 2020-08-10 2020-11-10 哈尔滨安天科技集团股份有限公司 Prediction method and device based on fusion map, electronic equipment and storage medium
CN113761196A (en) * 2021-07-28 2021-12-07 北京中科模识科技有限公司 Text clustering method and system, electronic device and storage medium
CN113761196B (en) * 2021-07-28 2024-02-20 北京中科模识科技有限公司 Text clustering method and system, electronic equipment and storage medium
CN113723116A (en) * 2021-08-25 2021-11-30 科大讯飞股份有限公司 Text translation method and related device, electronic equipment and storage medium
CN113723116B (en) * 2021-08-25 2024-02-13 中国科学技术大学 Text translation method and related device, electronic equipment and storage medium
CN113723101A (en) * 2021-09-09 2021-11-30 国网电子商务有限公司 Word sense disambiguation method and device applied to intention recognition
CN114943235A (en) * 2022-07-12 2022-08-26 长安大学 Named entity recognition method based on multi-class language model

Also Published As

Publication number Publication date
CN109033307B (en) 2021-08-31

Similar Documents

Publication Publication Date Title
CN109033307A (en) Word polyarch vector based on CRP cluster indicates and Word sense disambiguation method
CN106598944B (en) A kind of civil aviaton's security public sentiment sentiment analysis method
Haque et al. Multi-class sentiment classification on Bengali social media comments using machine learning
CN110362819B (en) Text emotion analysis method based on convolutional neural network
CN112231447B (en) Method and system for extracting Chinese document events
CN107562918A (en) A kind of mathematical problem knowledge point discovery and batch label acquisition method
CN101079025B (en) File correlation computing system and method
CN106610951A (en) Improved text similarity solving algorithm based on semantic analysis
CN108628828A (en) A kind of joint abstracting method of viewpoint and its holder based on from attention
CN103617290B (en) Chinese machine-reading system
CN108874896B (en) Humor identification method based on neural network and humor characteristics
CN114065758B (en) Document keyword extraction method based on hypergraph random walk
Ahammed et al. Implementation of machine learning to detect hate speech in Bangla language
CN107133212B (en) Text implication recognition method based on integrated learning and word and sentence comprehensive information
CN110879834B (en) Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof
CN112559684A (en) Keyword extraction and information retrieval method
CN109086355B (en) Hot-spot association relation analysis method and system based on news subject term
CN113761890B (en) Multi-level semantic information retrieval method based on BERT context awareness
CN108073571B (en) Multi-language text quality evaluation method and system and intelligent text processing system
CN106611041A (en) New text similarity solution method
Saad et al. Evaluation of support vector machine and decision tree for emotion recognition of malay folklores
CN113505200B (en) Sentence-level Chinese event detection method combined with document key information
CN109522547A (en) Chinese synonym iteration abstracting method based on pattern learning
CN107688630A (en) A kind of more sentiment dictionary extending methods of Weakly supervised microblogging based on semanteme
CN111191464A (en) Semantic similarity calculation method based on combined distance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant