CN109033307A - Word polyarch vector based on CRP cluster indicates and Word sense disambiguation method - Google Patents
Word polyarch vector based on CRP cluster indicates and Word sense disambiguation method Download PDFInfo
- Publication number
- CN109033307A CN109033307A CN201810783010.5A CN201810783010A CN109033307A CN 109033307 A CN109033307 A CN 109033307A CN 201810783010 A CN201810783010 A CN 201810783010A CN 109033307 A CN109033307 A CN 109033307A
- Authority
- CN
- China
- Prior art keywords
- word
- text
- cluster
- indicates
- ambiguity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Probability & Statistics with Applications (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a kind of word polyarch vector expression based on CRP cluster and Word sense disambiguation methods, including step 1: carrying out purification pretreatment to the text in mass text corpus and obtain plain text, the contextual window of target ambiguity word is concentrated to indicate based on CRP algorithm cluster corpus of text, concentrate target ambiguity word according to clustering cluster category label corpus of text, the polyarch vector that training obtains ambiguity word on the corpus of text collection of label indicates;Step 2: target short text is pre-processed and obtains short text sequence of terms, identify target ambiguity word in sequence of terms, it calculates the expression of target ambiguity word contextual window and corpus of text concentrates similarity between each clustering cluster mass center corresponding to the word, by term vector corresponding to similarity maximum value clustering cluster classification indicate as ambiguity word within a context the specific meaning of a word term vector indicate, to polysemant progress word sense disambiguation.The present invention solve the problems, such as word expression in polysemy indicate and the representation of word meaning in ambiguity identification problem.
Description
Technical field
The present invention relates to natural language processing field, in particular to a kind of word polyarch vector table based on CRP cluster
Show and Word sense disambiguation method.
Background technique
In numerous tasks of natural language processing field, the basic problem faced is how linguistic notation to be expressed as machine
The manageable coding mode of device.Mapping expression is carried out to linguistic notation, word, sentence, text etc. are expressed as one continuously
Low-dimensional vector, realize word, sentence, text semantic vectorization indicate, information retrieval, short text classification, name entity
The tasks such as identification, sentiment analysis, recommended engine, automatic text summarization are widely used.
Word is the most basic component units of language, and the vectorization expression of word has wide in natural language processing task
General application.The shortcomings that a kind of simple word vectors expression is One-hot Representation, this representation method be
Vector dimension is equal to the number of all words, there are problems that dimension disaster, can not portray the semantic relation between word, simultaneously
Different semantic meaning representations can not be reflected for ambiguity word.
The term vector of word indicates that (Word Embedding or Word Representation) is a kind of regular length
Low-dimensional real vector indicates, is learnt by the training to mass text, and obtaining the unique vector of each word indicates that feature is phase
Seemingly or relevant word is apart from upper closer.But due to the presence of ambiguity word in word, the same word symbol
It may reflect that different semantemes, most of traditional word term vectors indicate only corresponding unique word in different context of co-texts
Vector indicates, is unable to the different meaning of a word of effectively expressing ambiguity word.Each meaning of a word of ambiguity word should corresponding one to
Amount indicates.
Word polyarch vector indicates that corresponding to a term vector for each meaning of a word of ambiguity word indicates, can improve word
The accuracy that language indicates.The vector for obtaining the word difference meaning of a word indicates, usually using the model based on cluster, by clustering word
Context carries out meaning of a word conclusion, is directly clustered to the context of original text word or utilized across linguistry progress semantic
It is clustered after mapping, retraining obtains the corresponding term vector expression of word specific meaning of a word in different context of co-texts.
The side that polysemant words and phrases vector indicates is obtained based on k-means clustering algorithm and the training of neural network language model
The size of method, parameter k (cluster classification) needs to select different numerical value according to polysemant words and phrases justice number.And it is poly- based on CRP
The word polyarch vector of class indicates that training process does not need specified cluster class number in advance, meets different ambiguity words and exists
The inconsistent actual conditions of meaning of a word number in context.
The word representation of word meaning of high quality can capture semantic and syntactic information abundant, facilitate word sense disambiguation.High quality
Word sense disambiguation can preferably learn the expression of the word meaning of a word.Word sense disambiguation main method has two classes: being based on external knowledge library side
Method and method based on corpus.Based on external knowledge library method, by external knowledge library (WordNet or HowNet) to word
It is specifically semantic to carry out discrimination identification ambiguity word for different semantic explanations or description, but the building of external knowledge library or dictionary needs
Expend a large amount of manpower and material resources.Method based on corpus passes through automatically or semi-automatically using corpus as knowledge resource
It practises and determines the specific meaning of a word of word in a given context, to realize word sense disambiguation.
To the ambiguity word in sentence, using text corpus, the word polyarch vector based on acquisition is indicated, by giving
Word sense disambiguation method out obtains the word specific meaning of a word within a context, helps to improve the expression efficiency of word and sentence.
Internet technology and mobile application gradually popularize daily life, people using mobile terminal carry out information transmitting and
Communication becomes increasingly prevalent, and thereby produces the data of magnanimity, such as headline, micro-blog information, shopping platform
Commodity or service describing, forum's comment, intelligent interaction application and social conversation message etc., these data are usually by text structure
At it is a kind of typical short text form that length is shorter, and this short text data contains the information of a large amount of high values, is had very
High researching value.It is effectively handled using short text data of the machine to magnanimity on internet and understanding has become nature
The important Research Challenges and hot spot of Language Processing and machine learning field.
In the similarity calculation of information retrieval, word polyarch vector is indicated and Word sense disambiguation method can distinguish retrieval
The specific meaning of a word of ambiguity word in object improves the accuracy that word is indicated and calculated.For the short text in information retrieval field
Retrieval provides a kind of effective phrase semantic and indicates and Word sense disambiguation method, provides technical support for semantic computation.
Summary of the invention
The purpose of the present invention is overcoming above-mentioned problems of the prior art, a kind of word based on CRP cluster is provided
Polyarch vector indicates and Word sense disambiguation method, and word polyarch vector indicates corresponding for each meaning of a word of ambiguity word
One term vector indicates solve the problems, such as the expression of polysemy in word expression, indicate based on word polyarch vector
Word sense disambiguation method solves the problems, such as the identification of the ambiguity in the representation of word meaning.
The technical scheme is that the word polyarch vector based on CRP cluster indicates and Word sense disambiguation method, including
Following steps:
Step S1 carries out purification pretreatment to the text in mass text corpus and obtains plain text, poly- based on CRP algorithm
The contextual window of target ambiguity word indicates in class text corpus, and the target ambiguity word concentrated to corpus of text is according to poly-
Class cluster classification is marked, and the polyarch vector that training obtains ambiguity word on the corpus of text collection of label indicates;
Step S2 carries out the sequence of terms that pretreatment obtains short text to target short text, identifies the mesh in sequence of terms
Ambiguity word is marked, the contextual window for calculating target ambiguity word indicates to concentrate with corpus of text each poly- corresponding to the word
Similarity between class cluster mass center, term vector corresponding to the clustering cluster classification by similarity maximum value indicates, exists as ambiguity word
The term vector of the specific meaning of a word indicates in context, carries out word sense disambiguation to polysemant.
Purification pretreatment is carried out to the text in mass text corpus described in above-mentioned steps S1 and obtains plain text, packet
It includes: deleting the text that number of words is less than preset threshold;The complex form of Chinese characters is uniformly converted into simplified Chinese character;Using Chinese and English abbreviation dictionary, to text
The abbreviation of this corpus Chinese and English is replaced using Chinese word;The text concentrated to corpus of text segments;Remove stop words;
Delete other characters outside non-Chinese character and number;Count word frequency;The word frequency of frequent words is preset as upper limit threshold;Selection text
The word that frequency of occurrence is greater than predetermined lower threshold value in this corpus establishes word lists;Polysemant word is established based on polysemous dictionary
Language table.
The contextual window of target ambiguity word described in above-mentioned steps S1 indicates that method is by word context
The term vector of word averagely obtains, specific formula for calculation are as follows:
Wherein, veC is that the contextual window of word indicates, wiFor in word contextual window set of words Context
I-th of word, vec (wi) it is word wiInitial term vector.
The contextual window table of target ambiguity word is concentrated described in above-mentioned steps S1 based on CRP algorithm cluster corpus of text
Show, representation method includes the following steps:
Step S101 obtains the ambiguity word and concentrates the contextual window of all samples to indicate in corpus of text;
Step S102 obtains the initial clustering cluster mass center of CRP clustering algorithm, a random sample is taken to cluster as CRP
Initial clustering cluster mass center, or indicate to carry out initial clustering based on contextual window of the k-means algorithm to ambiguity word, it will wrap
Clustering cluster mass center containing most number of samples is as initial clustering cluster mass center;
Step S103 indicates the contextual window of all samples of ambiguity word, for all clustering clusters, meter
The similarity between each sample and each clustering cluster mass center is calculated, the maximum between i-th of sample and t-th of clustering cluster mass center is obtained
Similarity Smax;If Smax is greater than preset threshold α, i-th of sample is divided into t-th of clustering cluster, the sample in clustering cluster t
This quantity adds 1, recalculates the mass center of t-th of clustering cluster;Otherwise, new clustering cluster is generated, clustering cluster total number K increases by 1, newly gathers
Sample size is 1 in class cluster, and the mass center of new clustering cluster is sample i;
Step S104 obtains sample, the mass center of clustering cluster and the sum of clustering cluster in each clustering cluster.
Training obtains the polyarch vector expression of ambiguity word on the corpus of text collection of label described in above-mentioned steps S1,
Its representation method includes the following steps:
Step S201, to all samples of target ambiguity word described in corpus of text collection, according to affiliated clustering cluster into
Line flag, different clustering clusters represent the different meaning of a word of target word;
Step S202, executing the word term vector based on neural network language model in the clustering cluster of label indicates training
Process obtains the polyarch vector expression that word expresses the specific meaning of a word in different contexts.
Word sense disambiguation is carried out to polysemant described in above-mentioned steps S2, is included the following steps:
Step S301 pre-processes the target short text, the sequence of terms of short text is obtained, according to word
Polyarch vector indicates to identify the ambiguity word in the sequence of terms;
Step S302 carries out word sense disambiguation to the ambiguity word, it is upper in short text sequence of terms to calculate word
Hereafter window indicates and corpus of text concentrates the similarity between each clustering cluster mass center corresponding to the word, extracts similarity
Term vector corresponding to the clustering cluster classification of maximum value indicates, as ambiguity word express within a context the word of the specific meaning of a word to
Amount indicates.
The sequence of terms that pretreatment obtains short text is carried out to target short text described in above-mentioned steps S2, including removal stops
Word, the complex form of Chinese characters are converted into simplified Chinese character;Using Chinese and English abbreviation dictionary, Chinese word is used to the english abbreviation in target short text
Language is replaced;Word segmentation processing is carried out to short text;Other characters outside non-Chinese character and number are replaced using additional character.
Beneficial effects of the present invention: in the embodiment of the present invention, a kind of word polyarch vector table based on CRP cluster is provided
Show and Word sense disambiguation method, the contextual window table of target word is concentrated using the clustering algorithm cluster corpus of text based on CRP
Show, in the clustering cluster of label training obtain ambiguity word term vector indicate, improve ambiguity word vector table show it is accurate
Degree solves the problems, such as the expression of polysemy in word expression.To the ambiguity word in sentence, the polyarch vector table of word is utilized
Show, the similarity in the contextual window expression and training sample by calculating ambiguity word between the word cluster cluster mass center will
Term vector corresponding to the clustering cluster of similarity maximum value indicates, the term vector as ambiguity word certain semantic within a context
It indicates, eliminates the ambiguousness of ambiguity word.
Word polyarch vector representation method proposed by the present invention based on CRP cluster, using the clustering algorithm based on CRP
The context for clustering all samples of target ambiguity word indicates that a clustering cluster result represents the semanteme of target word one kind,
Training obtains word polyarch vector and indicates in the clustering cluster corpus of label.The polyarch vector expression of word can distinguish table
Show the different meaning of a word of ambiguity word, solves the problems, such as the expression of polysemy.
The present invention indicates to carry out using contextual window of the clustering algorithm based on CRP to all samples of target ambiguity word
Cluster, CRP algorithm cluster do not need specified cluster number in advance, and the clustering cluster number energy effectively expressing ambiguity word of acquisition is not
With the quantity of the meaning of a word, solves the inconsistent practical problem of different polysemant words and phrases justice numbers, utilize word contextual window table
That shows belongs to the judgment criteria of same clustering cluster as word, and calculating process is simple.
The Word sense disambiguation method proposed by the present invention indicated based on word polyarch vector, can identify the ambiguity in sentence
Word, and obtain word within a context the specific meaning of a word term vector indicate, eliminate ambiguity word in different context of co-texts
In ambiguousness.The contextual window for calculating target ambiguity word indicates to concentrate with corpus of text each poly- corresponding to the word
Similarity between class cluster mass center, term vector corresponding to the clustering cluster classification by similarity maximum value indicates, exists as ambiguity word
The term vector of the specific meaning of a word indicates in context, has carried out word sense disambiguation to ambiguity word.
Detailed description of the invention
Fig. 1 is based on the word polyarch vector expression of CRP cluster and the overall flow figure of word sense disambiguation;
Fig. 2 is the flow chart indicated based on CRP cluster ambiguity word contextual window;
Fig. 3 is the training process that the word polyarch vector based on CRP cluster indicates;
Fig. 4 is the word sense disambiguation flow chart indicated based on word polyarch vector;
Fig. 5 is the noun word sense disambiguation result indicated based on word polyarch vector;
Fig. 6 is the verb word sense disambiguation result indicated based on word polyarch vector.
Specific embodiment
With reference to the accompanying drawing, the specific embodiment of the present invention is described in detail, it is to be understood that of the invention
Protection scope be not limited by the specific implementation.
The invention discloses a kind of word polyarch vector expression based on CRP cluster and Word sense disambiguation methods, such as Fig. 1 institute
Show, basic ideas of the invention are the polyarch vectors that word is constructed on the basis of indicating based on CRP cluster word context
It indicates, identifies the ambiguity word in sentence or short text, eliminate the ambiguousness of ambiguity word, it is specific within a context to obtain word
The term vector of the meaning of a word indicates that the polyarch expression of term vector being capable of different languages of the more Precise Representation word in context of co-text
Justice.The specific steps of the present invention are as follows:
In step sl, purification pretreatment is carried out to the text in mass text corpus and obtains plain text: to disclosed
Or the corpus of text collection that acquisition obtains, delete the text that number of words is less than preset threshold;The complex form of Chinese characters that corpus of text is concentrated turns
Turn to simplified Chinese character;The english abbreviation that corpus of text is concentrated is replaced using Chinese word using Custom Dictionaries;Then it adopts
Word segmentation processing is carried out with Words partition system;Other characters unless outside Chinese character and number are removed using the matched method of canonical;It goes
Except stop words and count word frequency;The word frequency of frequent words is preset as upper limit threshold;Occurrence is finally concentrated out according to corpus of text
The word that number is greater than predetermined lower threshold value establishes word lists.
In step sl, obtain ambiguity word concentrates the contextual window of all samples to indicate that window is big in corpus of text
Small to be set as a positive integer, each contextual window indicates the term vector weighted calculation by word in window;
In step sl, as shown in Fig. 2, the contextual window based on CRP algorithm cluster all samples of target ambiguity word
It indicates, specifically:
1. indicating to carry out initial clustering based on contextual window of the k-means algorithm to ambiguity word, each cluster is obtained
Cluster and its mass center;
2. using the clustering cluster mass center comprising most number of samples as the initial clustering cluster mass center of CRP clustering algorithm;
3. the contextual window of pair all samples of ambiguity word indicates, for all clustering clusters, calculate each sample and
Similarity between each clustering cluster mass center obtains the maximum similarity Smax between i-th of sample and t-th of clustering cluster mass center;
4. if i-th of sample is divided into t-th of clustering cluster, the sample in clustering cluster t Smax is greater than preset threshold α
This quantity adds 1, recalculates the mass center of t-th of clustering cluster.Otherwise, new clustering cluster is generated, clustering cluster total number K increases by 1, newly gathers
Sample size is 1 in class cluster, and the mass center of new clustering cluster is sample i;
5. obtaining sample, the mass center of clustering cluster and the sum of clustering cluster in each clustering cluster.
Wherein, the 1st, 2 steps can simplify using first sample or to take a random sample as CRP and clustering and is initial
Clustering cluster mass center.
In step sl, as shown in figure 3, the polyarch vector that training obtains word indicates, specifically:
All contextual windows are concentrated to indicate in corpus of text 1. obtaining target ambiguity word;
2. the contextual window based on CRP algorithm cluster ambiguity word indicates, the clustering cluster of word context expression is obtained;
3. pair target ambiguity word, finds accordingly in urtext corpus according to target word and its context
Position carries out corresponding category label in target text corpus according to clustering cluster belonging to sample, and different clustering clusters represents
The different semanteme of target word;
4. pair each ambiguity word, execution step 1,2,3, by the category label of clustering cluster to target text corpus
In.
5. the polyarch vector for obtaining word based on CBOW model training on the corpus of text collection of label indicates.
In step s 2, as shown in figure 4, based on ambiguity words recognition and word sense disambiguation that word polyarch vector indicates,
Specifically:
1. a pair target short text pre-processes, specifically include: removal stop words, the complex form of Chinese characters are converted into simplified Chinese character;It utilizes
Chinese and English abbreviation dictionary, the english abbreviation in target sentences is replaced using Chinese word;Short text is carried out at participle
Reason;Other characters outside non-Chinese character and number are replaced using additional character, obtain the sequence of terms of short text.
2. identifying the word of ambiguity in sentence.The word of ambiguity in identification sequence of terms is indicated according to word polyarch vector
Language, ambiguity word there are two or more term vector indicate.
3. the contextual window for calculating ambiguity word indicates.Contextual window indicates adding by the term vector of context words
Weight average value indicates, for the ambiguity word occurred in context words, concentrates the word frequency of occurrence using in corpus of text
Term vector corresponding to most clustering clusters is as the term vector for participating in calculating, to unidentified word using word in contextual window
The average value of words and phrases vector is indicated.
4. pair ambiguity word, according to phrase semantic quantity number, according to more sequences after first few successively to ambiguity word
It is disambiguated.
5. calculate ambiguity word indicates between the mass center of training sample clustering cluster in the contextual window in short text sequence
Similarity, by term vector corresponding to the clustering cluster of similarity maximum value indicate as ambiguity word term vector expression.
According to the thought that is determined by its context of semanteme of word, ambiguity word certain semantic within a context passes through meter
The contextual window for calculating ambiguity word indicates between the corresponding corpus of text clustering cluster mass center of each term vector of ambiguity word
Similarity obtains, and term vector corresponding to similarity maximum value is indicated as ambiguity word certain semantic within a context
Term vector expression, circular are as follows:
Vec (w)={ veck(w)|k,Sim(veC,veck(w))=Max (Sim (veC, vecj(w)))} (2)
Wherein, vec (w) is that ambiguity word w corresponding certain semantic term vector in contextual window indicates, vecj(w)
For the term vector expression of j-th of semantic corresponding corpus of text clustering cluster mass center of ambiguity word w, Max (Sim (veC, vecj
(w)) veC and each vec) are indicated for ambiguity word w contextual windowj(w) maximum value of similarity, by maximum value corresponding
K term vector indicates the term vector as word w certain semantic.
Term is explained: the abbreviation of CRP:Chinese Restaurant Process, and Chinese is " Chinese restaurant's mistake
Journey " is typical Dirichlet (Di Li Cray) process mixed model, its advantage is that establishing the number of mixed model classification
Mesh is without artificial specified, the clustering problem being suitble in natural language processing.
The term vector polyarch of ambiguity word multiple meaning of a word in different contexts indicates.In table 1, the word pair of no label
The expression of word term vector, such as " apple " are answered, is the word for not distinguishing ambiguity.How former the specific meaning of a word corresponding word of word is
Type vector indicates, for example, " apple 2 " indicates the 2nd meaning of a word of word " apple ", refers to the apple of agricultural product." apple 1 "
Term vector indicates corresponding as IT company with it, and " apple 2 " then indicates it as a kind of meaning of fruit.Word polyarch to
Amount indicates that the semantic information of difference word can be captured.
The most close word of word or the meaning of a word of the table 1 based on CRP method
Embodiment based on the Word sense disambiguation method that word polyarch vector indicates.
Chinese corpus of the polysemant word sense disambiguation test data set in SemEval-2007#task5.Test data
Concentration shares 40 ambiguity words: being divided into verb and noun, at least there are two the meaning of a word for the meaning of each word.Word sense disambiguation test
The meaning of a word quantity of polysemant is different in data set, and majority is the 2-4 meaning of a word, and the most word of meaning of a word quantity " out " has 9 words
Justice.Such as word " Chinese medicine ", there are two the meaning of a word, respectively " practitioner of Chinese medicine " and
" traditional of Chinese medical science ", the meaning of a word are " doctor of traditional Chinese medicine " and " Chinese medicine medicine ", each word
Justice respectively has the specific text example of unequal number amount.
In word sense disambiguation test case, based on the Word sense disambiguation method that word polyarch vector indicates, in test set
Given each polysemant, the ambiguity word and its context extracted in text example indicates, calculates word polyarch vector table
Show the similarity between the corresponding corpus of text clustering cluster mass center of each term vector, obtain polyarch word corresponding to polysemant to
Amount indicates and its corresponding cluster classification, and polyarch term vector is indicated the standard that expressed meaning of a word classification and test set differentiates
It is compared, differentiates the correctness for disambiguating result.
The noun word sense disambiguation result indicated based on word polyarch vector is as shown in Figure 5.Based on word polyarch vector
The verb word sense disambiguation result of expression is as shown in Figure 6.
In information retrieval, word term vector polyarch is indicated and Word sense disambiguation method can identify ambiguity in retrieval object
Word improves the accuracy that word indicates in the specific semanteme of context, and it is more reasonable to calculate, and search result is more accurate.
It is more similar as a result, can make with retrieval sequence of terms or keyword in order to recall in information retrieval application
Similar sequence of terms or keyword are identified with similarity (sentence similarity, Words similarity).Words similarity, Ke Yitong
The included angle cosine value of two word term vectors is crossed to measure the similitude of word.
For example, the term vector polyarch of ambiguity word " doing accounts " is expressed as " do accounts 1 " and " do accounts 2 ", the semanteme of " do accounts 1 "
For the meaning of " doing accounts " or " calculate profit and loss ", the semanteme of " do accounts 2 " be " square of accounts after the autumn harvest " or " stand to lose or failure after again with people
The meaning of trial of strength "." do accounts 1 " and the similarity of word " clearing " and " revenge " are respectively 0.66,0.11, " do accounts 2 " and word
The similarity 0.14,0.72 of " clearing " and " revenge ".The similarity of " do accounts 1 " and " do accounts 2 " is 0.25, and word " doing accounts " is different
Semanteme between similitude differ greatly.
In information retrieval, when retrieval object is sentence, sentence similarity can be used to measure retrieval object and retrieval
Similitude between target.Searched targets are pre-processed, the sequence of terms of searched targets is obtained, word number is denoted as m, knows
Ambiguity word in other sequence of terms, the term vector for obtaining each word in sequence of terms indicates, is denoted as set D.To retrieval pair
The sentence of elephant is pre-processed, and the sequence of terms of retrieval sentence is obtained, and word number is denoted as n, identifies the ambiguity in sequence of terms
Word, the term vector for obtaining each word in sequence of terms indicates, is denoted as set S.
Calculate separately the similarity sim (D in set D and set S between each wordi, Sj), extract most like m
Word pair is retrieved the similarity Sim (D, S) between object S and target D, can be obtained by sentence similarity calculation formula:
Wherein,Indicate the sum of the similarity of m most like word pair, m is word in set D
The number of language, n are the number of word in set S.
For example, searched targets include ambiguity word " doing accounts ", searched targets are sentence { him is looked for do accounts }, and retrieval object is sentence
Sub 1 { telling a lie, other can look for you to do accounts }, sentence 2 { allowing them that cognition is gone to endanger in doing accounts }, after being pre-processed respectively
To sequence of terms set D={ looking for him to do accounts }, S1={ telling a lie, other can look for you to do accounts }, S2={ allows them in doing accounts
Cognition harm }.It identifies the ambiguity word in sequence of terms set D, S1, S2 and obtains the term vector expression of each word.Retrieval
The similarity of each word is as shown in table 2 between target D and retrieval object S1, S2.
Similarity table in 2 searched targets of table and retrieval object between each word
It can be obtained by formula 3, Sim (D, S1)=0.62, Sim (D, S2)=0.39, searched targets D and sentence S1 are more
Match, closer with authentic context, search result is more acurrate.
Disclosed above is only several specific embodiments of the invention, and still, the embodiment of the present invention is not limited to this, is appointed
What what those skilled in the art can think variation should all fall into protection scope of the present invention.
Claims (7)
1. the word polyarch vector based on CRP cluster indicates and Word sense disambiguation method, which comprises the steps of:
Step S1 carries out purification pretreatment to the text in mass text corpus and obtains plain text, based on CRP algorithm cluster text
In this corpus target ambiguity word contextual window indicate, to corpus of text concentrate target ambiguity word according to clustering cluster
Classification is marked, and the polyarch vector that training obtains ambiguity word on the corpus of text collection of label indicates;
Step S2 carries out the sequence of terms that pretreatment obtains short text to target short text, identifies that the target in sequence of terms is more
Adopted word, the contextual window for calculating target ambiguity word indicate to concentrate each clustering cluster corresponding to the word with corpus of text
Similarity between mass center, term vector corresponding to the clustering cluster classification by similarity maximum value indicates, as ambiguity word upper and lower
The term vector of the specific meaning of a word indicates in text, carries out word sense disambiguation to polysemant.
2. the word polyarch as described in claim 1 based on CRP cluster indicates and Word sense disambiguation method, which is characterized in that
Purification pretreatment is carried out to the text in mass text corpus described in step S1 and obtains plain text, comprising: it is few to delete number of words
In the text of preset threshold;The complex form of Chinese characters is uniformly converted into simplified Chinese character;Using Chinese and English abbreviation dictionary, contract to corpus of text Chinese and English
It writes and is replaced using Chinese word;The text concentrated to corpus of text segments;Remove stop words;Delete non-Chinese character
With other characters outside number;Count word frequency;The word frequency of frequent words is preset as upper limit threshold;It selects corpus of text to concentrate to occur
The word that number is greater than predetermined lower threshold value establishes word lists;Polysemant word lists are established based on polysemous dictionary.
3. the word polyarch vector as described in claim 1 based on CRP cluster indicates and Word sense disambiguation method, feature exist
In the contextual window of target ambiguity word described in step S1 indicates that method is by the word of word in word context
Vector averagely obtains, specific formula for calculation are as follows:
Wherein, veC is that the contextual window of word indicates, wiFor i-th in word contextual window set of words Context
Word, vec (wi) it is word wiInitial term vector.
4. the word polyarch vector as described in claim 1 based on CRP cluster indicates and Word sense disambiguation method, feature exist
In the contextual window based on CRP algorithm cluster corpus of text concentration target ambiguity word described in step S1 indicates, indicates
Method includes the following steps:
Step S101 obtains the ambiguity word and concentrates the contextual window of all samples to indicate in corpus of text;
Step S102 obtains the initial clustering cluster mass center of CRP clustering algorithm, takes a random sample to cluster as CRP initial
Clustering cluster mass center, or indicate to carry out initial clustering based on contextual window of the k-means algorithm to ambiguity word, it will be comprising most
The clustering cluster mass center of multi-quantity sample is as initial clustering cluster mass center;
Step S103 indicates the contextual window of all samples of ambiguity word, for all clustering clusters, calculates every
It is similar to the maximum between t-th of clustering cluster mass center to obtain i-th of sample for similarity between a sample and each clustering cluster mass center
Spend Smax;If Smax is greater than preset threshold α, i-th of sample is divided into t-th of clustering cluster, the sample number in clustering cluster t
Amount plus 1, recalculates the mass center of t-th of clustering cluster;Otherwise, new clustering cluster is generated, clustering cluster total number K increases by 1, new clustering cluster
Middle sample size is 1, and the mass center of new clustering cluster is sample i;
Step S104 obtains sample, the mass center of clustering cluster and the sum of clustering cluster in each clustering cluster.
5. the word polyarch vector as described in claim 1 based on CRP cluster indicates and Word sense disambiguation method, feature exist
In the polyarch vector that training obtains ambiguity word on the corpus of text collection of label described in step S1 indicates, expression side
Method includes the following steps:
Step S201 marks all samples of target ambiguity word described in corpus of text collection according to affiliated clustering cluster
Note, different clustering clusters represent the different meaning of a word of target word;
Step S202 executes the word term vector expression based on neural network language model in the clustering cluster of label and trained
Journey obtains the polyarch vector expression that word expresses the specific meaning of a word in different contexts.
6. the word polyarch vector as described in claim 1 based on CRP cluster indicates and Word sense disambiguation method, feature exist
In, described in step S2 to polysemant carry out word sense disambiguation, include the following steps:
Step S301 pre-processes the target short text, obtains the sequence of terms of short text, according to the mostly former of word
Type vector indicates to identify the ambiguity word in the sequence of terms;
Step S302 carries out word sense disambiguation to the ambiguity word, calculates context of the word in short text sequence of terms
Window indicates and corpus of text concentrates the similarity between each clustering cluster mass center corresponding to the word, extracts similarity maximum
Term vector corresponding to the clustering cluster classification of value indicates, expresses the term vector table of the specific meaning of a word within a context as ambiguity word
Show.
7. the word polyarch vector as described in claim 1 based on CRP cluster indicates and Word sense disambiguation method, feature exist
In, described in step S2 to target short text carry out pretreatment obtain short text sequence of terms, including removal stop words, traditional font
Word is converted into simplified Chinese character;Using Chinese and English abbreviation dictionary, the english abbreviation in target short text is replaced using Chinese word
It changes;Word segmentation processing is carried out to short text;Other characters outside non-Chinese character and number are replaced using additional character.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810783010.5A CN109033307B (en) | 2018-07-17 | 2018-07-17 | CRP clustering-based word multi-prototype vector representation and word sense disambiguation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810783010.5A CN109033307B (en) | 2018-07-17 | 2018-07-17 | CRP clustering-based word multi-prototype vector representation and word sense disambiguation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109033307A true CN109033307A (en) | 2018-12-18 |
CN109033307B CN109033307B (en) | 2021-08-31 |
Family
ID=64643470
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810783010.5A Active CN109033307B (en) | 2018-07-17 | 2018-07-17 | CRP clustering-based word multi-prototype vector representation and word sense disambiguation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109033307B (en) |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109740162A (en) * | 2019-01-09 | 2019-05-10 | 安徽省泰岳祥升软件有限公司 | Document representation method, device and medium |
CN109783806A (en) * | 2018-12-21 | 2019-05-21 | 众安信息技术服务有限公司 | A kind of text matching technique using semantic analytic structure |
CN109960799A (en) * | 2019-03-12 | 2019-07-02 | 中南大学 | A kind of Optimum Classification method towards short text |
CN110309515A (en) * | 2019-07-10 | 2019-10-08 | 北京奇艺世纪科技有限公司 | Entity recognition method and device |
CN110532395A (en) * | 2019-05-13 | 2019-12-03 | 南京大学 | A kind of method for building up of the term vector improved model based on semantic embedding |
CN110705274A (en) * | 2019-09-06 | 2020-01-17 | 电子科技大学 | Fusion type word meaning embedding method based on real-time learning |
CN110717015A (en) * | 2019-10-10 | 2020-01-21 | 大连理工大学 | Neural network-based polysemous word recognition method |
CN110765781A (en) * | 2019-12-11 | 2020-02-07 | 沈阳航空航天大学 | Man-machine collaborative construction method for domain term semantic knowledge base |
CN111159337A (en) * | 2019-12-20 | 2020-05-15 | 中国建设银行股份有限公司 | Chemical expression extraction method, device and equipment |
CN111310475A (en) * | 2020-02-04 | 2020-06-19 | 支付宝(杭州)信息技术有限公司 | Training method and device of word sense disambiguation model |
CN111414523A (en) * | 2020-03-11 | 2020-07-14 | 中国建设银行股份有限公司 | Data acquisition method and device |
CN111507098A (en) * | 2020-04-17 | 2020-08-07 | 腾讯科技(深圳)有限公司 | Ambiguous word recognition method and device, electronic equipment and computer-readable storage medium |
CN111523312A (en) * | 2020-04-22 | 2020-08-11 | 南京贝湾信息科技有限公司 | Paraphrase disambiguation-based query display method and device and computing equipment |
CN111783418A (en) * | 2020-06-09 | 2020-10-16 | 北京北大软件工程股份有限公司 | Chinese meaning representation learning method and device |
CN111914569A (en) * | 2020-08-10 | 2020-11-10 | 哈尔滨安天科技集团股份有限公司 | Prediction method and device based on fusion map, electronic equipment and storage medium |
CN112579769A (en) * | 2019-09-30 | 2021-03-30 | 北京国双科技有限公司 | Keyword clustering method and device, storage medium and electronic equipment |
CN113723116A (en) * | 2021-08-25 | 2021-11-30 | 科大讯飞股份有限公司 | Text translation method and related device, electronic equipment and storage medium |
CN113723101A (en) * | 2021-09-09 | 2021-11-30 | 国网电子商务有限公司 | Word sense disambiguation method and device applied to intention recognition |
CN113761196A (en) * | 2021-07-28 | 2021-12-07 | 北京中科模识科技有限公司 | Text clustering method and system, electronic device and storage medium |
CN114943235A (en) * | 2022-07-12 | 2022-08-26 | 长安大学 | Named entity recognition method based on multi-class language model |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080065623A1 (en) * | 2006-09-08 | 2008-03-13 | Microsoft Corporation | Person disambiguation using name entity extraction-based clustering |
US20140214840A1 (en) * | 2010-11-29 | 2014-07-31 | Google Inc. | Name Disambiguation Using Context Terms |
CN103970729A (en) * | 2014-04-29 | 2014-08-06 | 河海大学 | Multi-subject extracting method based on semantic categories |
CN104008090A (en) * | 2014-04-29 | 2014-08-27 | 河海大学 | Multi-subject extraction method based on concept vector model |
CN104731771A (en) * | 2015-03-27 | 2015-06-24 | 大连理工大学 | Term vector-based abbreviation ambiguity elimination system and method |
CN104778158A (en) * | 2015-03-04 | 2015-07-15 | 新浪网技术(中国)有限公司 | Method and device for representing text |
CN104778186A (en) * | 2014-01-15 | 2015-07-15 | 阿里巴巴集团控股有限公司 | Method and system for hanging commodity object to standard product unit (SPU) |
US20160292149A1 (en) * | 2014-08-02 | 2016-10-06 | Google Inc. | Word sense disambiguation using hypernyms |
CN106598947A (en) * | 2016-12-15 | 2017-04-26 | 山西大学 | Bayesian word sense disambiguation method based on synonym expansion |
CN107861939A (en) * | 2017-09-30 | 2018-03-30 | 昆明理工大学 | A kind of domain entities disambiguation method for merging term vector and topic model |
-
2018
- 2018-07-17 CN CN201810783010.5A patent/CN109033307B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080065623A1 (en) * | 2006-09-08 | 2008-03-13 | Microsoft Corporation | Person disambiguation using name entity extraction-based clustering |
US20140214840A1 (en) * | 2010-11-29 | 2014-07-31 | Google Inc. | Name Disambiguation Using Context Terms |
CN104778186A (en) * | 2014-01-15 | 2015-07-15 | 阿里巴巴集团控股有限公司 | Method and system for hanging commodity object to standard product unit (SPU) |
CN103970729A (en) * | 2014-04-29 | 2014-08-06 | 河海大学 | Multi-subject extracting method based on semantic categories |
CN104008090A (en) * | 2014-04-29 | 2014-08-27 | 河海大学 | Multi-subject extraction method based on concept vector model |
US20160292149A1 (en) * | 2014-08-02 | 2016-10-06 | Google Inc. | Word sense disambiguation using hypernyms |
CN104778158A (en) * | 2015-03-04 | 2015-07-15 | 新浪网技术(中国)有限公司 | Method and device for representing text |
CN104731771A (en) * | 2015-03-27 | 2015-06-24 | 大连理工大学 | Term vector-based abbreviation ambiguity elimination system and method |
CN106598947A (en) * | 2016-12-15 | 2017-04-26 | 山西大学 | Bayesian word sense disambiguation method based on synonym expansion |
CN107861939A (en) * | 2017-09-30 | 2018-03-30 | 昆明理工大学 | A kind of domain entities disambiguation method for merging term vector and topic model |
Non-Patent Citations (4)
Title |
---|
BASILI,R等: "Contextual word sense tuning and disambiguation", 《APPLIED ARTIFICIAL INTELLIGENCE》 * |
张晗: "融合句义特征的人名消歧及人物关系抽取技术研究", 《中国优秀硕士学位论文全文数据库(电子期刊)》 * |
王瑞琴等: "无监督词义消歧研究", 《软件学报》 * |
郭鸿奇等: "一种基于词语多原型向量表示的句子相似度计算方法", 《智能计算机与应用》 * |
Cited By (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109783806A (en) * | 2018-12-21 | 2019-05-21 | 众安信息技术服务有限公司 | A kind of text matching technique using semantic analytic structure |
CN109783806B (en) * | 2018-12-21 | 2023-05-02 | 众安信息技术服务有限公司 | Text matching method utilizing semantic parsing structure |
CN109740162A (en) * | 2019-01-09 | 2019-05-10 | 安徽省泰岳祥升软件有限公司 | Document representation method, device and medium |
CN109740162B (en) * | 2019-01-09 | 2023-07-11 | 安徽省泰岳祥升软件有限公司 | Text representation method, device and medium |
CN109960799A (en) * | 2019-03-12 | 2019-07-02 | 中南大学 | A kind of Optimum Classification method towards short text |
CN110532395B (en) * | 2019-05-13 | 2021-09-28 | 南京大学 | Semantic embedding-based word vector improvement model establishing method |
CN110532395A (en) * | 2019-05-13 | 2019-12-03 | 南京大学 | A kind of method for building up of the term vector improved model based on semantic embedding |
CN110309515B (en) * | 2019-07-10 | 2023-08-11 | 北京奇艺世纪科技有限公司 | Entity identification method and device |
CN110309515A (en) * | 2019-07-10 | 2019-10-08 | 北京奇艺世纪科技有限公司 | Entity recognition method and device |
CN110705274B (en) * | 2019-09-06 | 2023-03-24 | 电子科技大学 | Fusion type word meaning embedding method based on real-time learning |
CN110705274A (en) * | 2019-09-06 | 2020-01-17 | 电子科技大学 | Fusion type word meaning embedding method based on real-time learning |
CN112579769A (en) * | 2019-09-30 | 2021-03-30 | 北京国双科技有限公司 | Keyword clustering method and device, storage medium and electronic equipment |
CN110717015A (en) * | 2019-10-10 | 2020-01-21 | 大连理工大学 | Neural network-based polysemous word recognition method |
CN110765781B (en) * | 2019-12-11 | 2023-07-14 | 沈阳航空航天大学 | Man-machine collaborative construction method for domain term semantic knowledge base |
CN110765781A (en) * | 2019-12-11 | 2020-02-07 | 沈阳航空航天大学 | Man-machine collaborative construction method for domain term semantic knowledge base |
CN111159337A (en) * | 2019-12-20 | 2020-05-15 | 中国建设银行股份有限公司 | Chemical expression extraction method, device and equipment |
CN111310475B (en) * | 2020-02-04 | 2023-03-10 | 支付宝(杭州)信息技术有限公司 | Training method and device of word sense disambiguation model |
CN111310475A (en) * | 2020-02-04 | 2020-06-19 | 支付宝(杭州)信息技术有限公司 | Training method and device of word sense disambiguation model |
CN111414523A (en) * | 2020-03-11 | 2020-07-14 | 中国建设银行股份有限公司 | Data acquisition method and device |
CN111507098A (en) * | 2020-04-17 | 2020-08-07 | 腾讯科技(深圳)有限公司 | Ambiguous word recognition method and device, electronic equipment and computer-readable storage medium |
CN111507098B (en) * | 2020-04-17 | 2023-03-21 | 腾讯科技(深圳)有限公司 | Ambiguous word recognition method and device, electronic equipment and computer-readable storage medium |
CN111523312A (en) * | 2020-04-22 | 2020-08-11 | 南京贝湾信息科技有限公司 | Paraphrase disambiguation-based query display method and device and computing equipment |
CN111523312B (en) * | 2020-04-22 | 2023-06-16 | 南京贝湾信息科技有限公司 | Word searching display method and device based on paraphrasing disambiguation and computing equipment |
CN111783418A (en) * | 2020-06-09 | 2020-10-16 | 北京北大软件工程股份有限公司 | Chinese meaning representation learning method and device |
CN111783418B (en) * | 2020-06-09 | 2024-04-05 | 北京北大软件工程股份有限公司 | Chinese word meaning representation learning method and device |
CN111914569A (en) * | 2020-08-10 | 2020-11-10 | 哈尔滨安天科技集团股份有限公司 | Prediction method and device based on fusion map, electronic equipment and storage medium |
CN113761196A (en) * | 2021-07-28 | 2021-12-07 | 北京中科模识科技有限公司 | Text clustering method and system, electronic device and storage medium |
CN113761196B (en) * | 2021-07-28 | 2024-02-20 | 北京中科模识科技有限公司 | Text clustering method and system, electronic equipment and storage medium |
CN113723116A (en) * | 2021-08-25 | 2021-11-30 | 科大讯飞股份有限公司 | Text translation method and related device, electronic equipment and storage medium |
CN113723116B (en) * | 2021-08-25 | 2024-02-13 | 中国科学技术大学 | Text translation method and related device, electronic equipment and storage medium |
CN113723101A (en) * | 2021-09-09 | 2021-11-30 | 国网电子商务有限公司 | Word sense disambiguation method and device applied to intention recognition |
CN114943235A (en) * | 2022-07-12 | 2022-08-26 | 长安大学 | Named entity recognition method based on multi-class language model |
Also Published As
Publication number | Publication date |
---|---|
CN109033307B (en) | 2021-08-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109033307A (en) | Word polyarch vector based on CRP cluster indicates and Word sense disambiguation method | |
CN106598944B (en) | A kind of civil aviaton's security public sentiment sentiment analysis method | |
Haque et al. | Multi-class sentiment classification on Bengali social media comments using machine learning | |
CN110362819B (en) | Text emotion analysis method based on convolutional neural network | |
CN112231447B (en) | Method and system for extracting Chinese document events | |
CN107562918A (en) | A kind of mathematical problem knowledge point discovery and batch label acquisition method | |
CN101079025B (en) | File correlation computing system and method | |
CN106610951A (en) | Improved text similarity solving algorithm based on semantic analysis | |
CN108628828A (en) | A kind of joint abstracting method of viewpoint and its holder based on from attention | |
CN103617290B (en) | Chinese machine-reading system | |
CN108874896B (en) | Humor identification method based on neural network and humor characteristics | |
CN114065758B (en) | Document keyword extraction method based on hypergraph random walk | |
Ahammed et al. | Implementation of machine learning to detect hate speech in Bangla language | |
CN107133212B (en) | Text implication recognition method based on integrated learning and word and sentence comprehensive information | |
CN110879834B (en) | Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof | |
CN112559684A (en) | Keyword extraction and information retrieval method | |
CN109086355B (en) | Hot-spot association relation analysis method and system based on news subject term | |
CN113761890B (en) | Multi-level semantic information retrieval method based on BERT context awareness | |
CN108073571B (en) | Multi-language text quality evaluation method and system and intelligent text processing system | |
CN106611041A (en) | New text similarity solution method | |
Saad et al. | Evaluation of support vector machine and decision tree for emotion recognition of malay folklores | |
CN113505200B (en) | Sentence-level Chinese event detection method combined with document key information | |
CN109522547A (en) | Chinese synonym iteration abstracting method based on pattern learning | |
CN107688630A (en) | A kind of more sentiment dictionary extending methods of Weakly supervised microblogging based on semanteme | |
CN111191464A (en) | Semantic similarity calculation method based on combined distance |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |