CN109033307B - CRP clustering-based word multi-prototype vector representation and word sense disambiguation method - Google Patents
CRP clustering-based word multi-prototype vector representation and word sense disambiguation method Download PDFInfo
- Publication number
- CN109033307B CN109033307B CN201810783010.5A CN201810783010A CN109033307B CN 109033307 B CN109033307 B CN 109033307B CN 201810783010 A CN201810783010 A CN 201810783010A CN 109033307 B CN109033307 B CN 109033307B
- Authority
- CN
- China
- Prior art keywords
- word
- words
- polysemous
- clustering
- cluster
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 239000013598 vector Substances 0.000 title claims abstract description 116
- 238000000034 method Methods 0.000 title claims abstract description 45
- 238000012549 training Methods 0.000 claims abstract description 17
- 238000007781 pre-processing Methods 0.000 claims abstract description 14
- 238000004364 calculation method Methods 0.000 claims description 10
- 238000012545 processing Methods 0.000 claims description 5
- 230000011218 segmentation Effects 0.000 claims description 5
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 238000000746 purification Methods 0.000 claims description 3
- 238000012935 Averaging Methods 0.000 claims description 2
- 238000007670 refining Methods 0.000 claims 1
- 238000012360 testing method Methods 0.000 description 6
- 238000003058 natural language processing Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 3
- 239000003814 drug Substances 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000006698 induction Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000003064 k means clustering Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a CRP clustering-based word multi-prototype vector representation and word sense disambiguation method, which comprises the following steps: purifying and preprocessing texts in a mass text corpus set to obtain a pure text, clustering the context window representation of target polysemous words in the text corpus set based on a CRP algorithm, marking the target polysemous words in the text corpus set according to cluster categories, and training on the marked text corpus set to obtain polysemous vector representation of the polysemous words; step two: preprocessing a target short text to obtain a short text word sequence, identifying target polysemous words in the word sequence, calculating the similarity between the target polysemous word context window representation and the centroid of each cluster corresponding to the words in a text corpus, representing the word vector corresponding to the cluster category with the maximum similarity as the word vector representation of the specific word meaning of the polysemous words in the context, and carrying out word meaning disambiguation on the polysemous words. The invention solves the problem of word meaning expression and the problem of ambiguity recognition in word meaning expression.
Description
Technical Field
The invention relates to the field of natural language processing, in particular to a CRP clustering-based word multi-prototype vector representation and word sense disambiguation method.
Background
Among the many tasks in the field of natural language processing, the fundamental problem faced is how to represent linguistic symbols as coding patterns that a machine can process. The language symbol is mapped and expressed, words, sentences, texts and the like are expressed into a continuous low-dimensional vector, semantic vectorization expression of the words, the sentences and the texts is realized, and the method is widely applied to tasks such as information retrieval, short text classification, named entity recognition, emotion analysis, recommendation engines, automatic text summarization and the like.
The words are the most basic constituent units of the language, and the vectorization representation of the words has wide application in natural language processing tasks. A simple word vector Representation is One-hot Representation, which has the disadvantages that the vector dimension is equal to the number of all words, there is a dimension disaster problem, semantic relations between words cannot be described, and different semantic representations cannot be reflected for ambiguous words.
Word vector Representation (Word Embedding or Word Representation) of words is low-dimensional real number vector Representation with fixed length, and through training and learning of massive texts, unique vector Representation of each Word is obtained, and the characteristic is that similar or related words are closer in distance. However, due to the existence of ambiguous words in words, the same word symbol may reflect different semantics in different context, and most of the conventional word vector representations only correspond to unique word vector representations, and cannot effectively express different word senses of the ambiguous words. Each sense of a multi-sense word should correspond to a vector representation.
The word multi-prototype vector representation corresponds to a word vector representation for each word meaning of the multi-meaning word, and can improve the accuracy of word representation. Vector representations of different word senses of the words are obtained, word sense induction is carried out by clustering word contexts by using a clustering-based model, the contexts of the original words are directly clustered or clustered after semantic mapping by using cross-language knowledge, and then the word vector representations corresponding to the specific word senses of the words in different context contexts are obtained by training.
The method for obtaining the expression of the word vectors of the polysemous words based on a k-means clustering algorithm and neural network language model training needs to select different values according to the number of the word senses of the polysemous words according to the size of a parameter k (clustering category). The word multi-prototype vector based on CRP clustering represents that the training process does not need to appoint the number of clustering categories in advance, and accords with the actual situation that the number of word senses of different polysemous words is inconsistent in context.
High quality word sense representation can capture rich semantic and syntactic information, and is helpful for word sense disambiguation. High quality word sense disambiguation allows better learning of the representation of the word sense. The main methods for word sense disambiguation are two types: external knowledge base based methods and corpus based methods. Based on the external knowledge base method, the specific semantics of the ambiguous words are distinguished and recognized by means of the explanation or description of different semantics of the words by an external knowledge base (WordNet or HowNet), but the construction of the external knowledge base or a dictionary needs to consume a large amount of manpower and material resources. The method based on the corpus takes the corpus as a knowledge resource, and determines the specific word meaning of a word in a given context through automatic or semi-automatic learning, thereby realizing word meaning disambiguation.
The method has the advantages that the text corpus is utilized for polysemous words in the sentences, the specific word senses of the words in the context are obtained through the given word sense disambiguation method based on the obtained multi-prototype vector representation of the words, and the representation efficiency of the words and the sentences is improved.
Internet technology and mobile application are gradually popularized in daily life, people use mobile terminals to carry out information transmission and communication, and accordingly massive data such as news headlines, microblog information, commodity or service description of a shopping platform, forum comments, intelligent interactive application, social conversation messages and the like are generated. The effective processing and understanding of massive short text data on the internet by using a machine has become an important research difficulty and hotspot in the fields of natural language processing and machine learning.
In similarity calculation of information retrieval, the word multi-prototype vector representation and word sense disambiguation method can distinguish the specific word sense of the polysemous word in a retrieval object, and the accuracy of word representation and calculation is improved. An effective word semantic representation and word sense disambiguation method is provided for short text retrieval in the field of information retrieval, and technical support is provided for semantic calculation.
Disclosure of Invention
The present invention is directed to overcome the problems in the prior art, and provides a word polytype vector representation and word sense disambiguation method based on CRP clustering, in which the word polytype vector representation corresponds to a word vector representation for each word sense of a polysemous word, thereby solving the problem of word polysemous representation in word representation, and the word sense disambiguation method based on word polytype vector representation solves the problem of ambiguity recognition in word sense representation.
The technical scheme of the invention is as follows: the CRP clustering-based word multi-prototype vector representation and word sense disambiguation method comprises the following steps:
step S1, carrying out purification pretreatment on the texts in the mass text corpus to obtain pure texts, clustering the context window representation of the target polysemous words in the text corpus based on a CRP algorithm, marking the target polysemous words in the text corpus according to the cluster category, and training on the marked text corpus to obtain polysemous vector representation of the polysemous words;
step S2, preprocessing the target short text to obtain a word sequence of the short text, identifying the target polysemous words in the word sequence, calculating the similarity between the context window representation of the target polysemous words and the centroid of each cluster corresponding to the words in the text corpus, representing the word vector corresponding to the cluster category with the maximum similarity as the word vector representation of the polysemous words with specific word senses in the context, and disambiguating the word senses of the polysemous words.
The step S1 of performing a purification preprocessing on the text in the mass text corpus to obtain a plain text includes: deleting text with the number of words less than a preset threshold; the complex characters are uniformly converted into simplified characters; replacing Chinese and English abbreviations of the text corpus by using Chinese words by using a Chinese and English abbreviation dictionary; segmenting the texts in the text corpus set; removing stop words; deleting other characters except non-Chinese characters and numbers; counting word frequency; presetting the word frequency of the high-frequency words as an upper limit threshold; selecting words with the occurrence frequency of the text corpus set larger than a preset lower limit threshold value to establish a word list; and establishing an ambiguous word list based on the ambiguous word dictionary.
The context window representation of the target ambiguous word in step S1 is obtained by averaging word vectors of words in the word context, and the specific calculation formula is as follows:
where veC is the contextual window representation of the word, wiIs the ith word in the word Context window word set Context, vec (w)i) Is the word wiThe initial word vector of (2).
The method for representing the context window of the target ambiguous word in the clustered corpus based on the CRP algorithm in step S1 includes the following steps:
step S101, obtaining the context window representation of all samples of the polysemous words in the text corpus;
step S102, obtaining an initial clustering centroid of a CRP clustering algorithm, taking a random sample as the initial clustering centroid of the CRP clustering, or performing initial clustering on the context window representation of the ambiguous word based on a k-means algorithm, and taking the clustering centroid containing the most samples as the initial clustering centroid;
step S103, representing context windows of all samples of the polysemous words, calculating the similarity between each sample and the centroid of each cluster for all clusters, and obtaining the maximum similarity Smax between the ith sample and the centroid of the tth cluster; if Smax is larger than a preset threshold value alpha, dividing the ith sample into the tth cluster, adding 1 to the number of samples in the t cluster, and recalculating the centroid of the tth cluster; otherwise, generating a new cluster, wherein the total number K of the clusters is increased by 1, the number of samples in the new cluster is 1, and the centroid of the new cluster is a sample i;
and step S104, obtaining samples in each cluster, the centroid of each cluster and the total number of clusters.
The method for training the corpus of labeled text to obtain the multi-prototype vector representation of the ambiguous word in step S1 includes the following steps:
step S201, marking all samples of the target polysemous words in the text corpus according to the belonged clustering clusters, wherein different clustering clusters represent different word senses of the target words;
step S202, executing a word vector representation training process based on a neural network language model on the marked cluster, and obtaining multi-prototype vector representation of words expressing specific word senses in different contexts.
The word sense disambiguation of the ambiguous word in step S2 includes the following steps:
step S301, preprocessing the target short text to obtain a word sequence of the short text, and identifying polysemous words in the word sequence according to the multi-prototype vector representation of the words;
step S302, carrying out word sense disambiguation on the polysemous words, calculating the similarity between the context window representation of the words in the short text word sequence and the centroids of all clustering clusters corresponding to the words in the text corpus, and extracting word vector representation corresponding to the clustering cluster category with the maximum similarity as word vector representation of the polysemous words expressing specific word senses in the context.
Preprocessing the target short text to obtain a word sequence of the short text in the step S2, including removing stop words and converting traditional characters into simplified characters; replacing English abbreviations in the target short text with Chinese words by using a Chinese and English abbreviation dictionary; performing word segmentation processing on the short text; characters other than chinese characters and numerals are replaced with special symbols.
The invention has the beneficial effects that: the embodiment of the invention provides a word multi-prototype vector representation and word sense disambiguation method based on CRP clustering, which adopts a clustering algorithm based on CRP to cluster context window representation of target words in a text corpus, trains on a marked clustering cluster to obtain word vector representation of polysemous words, improves the accuracy of polysemous word vector representation, and solves the problem of word polysemous representation in word representation. For the polysemous words in the sentence, the polysemous vector representation of the words is utilized, the similarity between the context window representation of the polysemous words and the centroid of the word cluster in the training sample is calculated, the word vector representation corresponding to the cluster with the maximum similarity is used as the word vector representation of the polysemous words with specific semantics in the context, and the ambiguity of the polysemous words is eliminated.
The invention provides a CRP clustering-based word multi-prototype vector representation method, which adopts a CRP-based clustering algorithm to cluster context representations of all samples of target polysemous words, one clustering result represents the semantics of one target word, and word multi-prototype vector representation is obtained by training on a marked clustering corpus. The multi-prototype vector representation of the words can be used for distinguishing different word senses of the multi-sense words, and the problem of representing the word sense is solved.
The invention clusters the context window representations of all samples of the target polysemous words by adopting a CRP-based clustering algorithm, the CRP algorithm does not need to appoint the clustering number in advance, the obtained clustering cluster number can effectively express the number of different word senses of the polysemous words, the actual problem that the word sense numbers of the different polysemous words are inconsistent is solved, the word context window representations are used as the judgment standard for the words belonging to the same clustering cluster, and the calculation process is simple.
The word sense disambiguation method based on the word multi-prototype vector representation can identify the multi-sense words in the sentence, obtain the word vector representation of the specific word sense of the words in the context, and eliminate the ambiguity of the multi-sense words in different context. Calculating the similarity between the context window representation of the target polysemous word and the centroid of each cluster corresponding to the word in the text corpus, representing the word vector corresponding to the cluster category with the maximum similarity as the word vector representation of the polysemous word with the specific word meaning in the context, and carrying out word meaning disambiguation on the polysemous word.
Drawings
FIG. 1 is an overall flow diagram of word polytype vector representation and word sense disambiguation based on CRP clustering;
FIG. 2 is a flow diagram of an ambiguous word context window representation based on CRP clustering;
FIG. 3 is a training flow of a word polytypic vector representation based on CRP clustering;
FIG. 4 is a flow diagram of word sense disambiguation based on word polytype vector representations;
FIG. 5 is a result of noun word sense disambiguation based on word polytype vector representations;
FIG. 6 is a verb sense disambiguation result based on a word polytype vector representation.
Detailed Description
An embodiment of the present invention will be described in detail below with reference to the accompanying drawings, but it should be understood that the scope of the present invention is not limited to the embodiment.
The invention discloses a CRP clustering-based word multi-prototype vector representation and word sense disambiguation method, as shown in figure 1, the basic idea of the invention is to construct multi-prototype vector representation of words on the basis of CRP clustering word context representation, identify polysemous words in sentences or short texts, eliminate ambiguity of the polysemous words, obtain word vector representation of specific word senses of the words in the contexts, and the multi-prototype representation of the word vector can more accurately represent different semantics of the words in the context. The method comprises the following specific steps:
in step S1, the text in the mass text corpus is purified and preprocessed to obtain a plain text: deleting the texts with the word number less than a preset threshold value from the disclosed or acquired text corpus set; converting traditional characters in a text corpus set into simplified characters; replacing English abbreviations in the text corpus set with Chinese words by using a custom dictionary; then, a word segmentation system is adopted for word segmentation processing; removing other characters except non-Chinese characters and numbers by adopting a regular matching method; removing stop words and counting word frequency; presetting the word frequency of the high-frequency words as an upper limit threshold; and finally, establishing a word list according to the words with the occurrence times of the text corpus set larger than a preset lower limit threshold.
In step S1, context window representations of all samples of the ambiguous word in the corpus of text are obtained, the window size is set to a positive integer, and each context window representation is calculated by weighting the word vectors of the words in the window;
in step S1, as shown in fig. 2, the context window representation of all samples of the target ambiguous word is clustered based on the CRP algorithm, specifically:
1. performing initial clustering on the context window representation of the polysemous words based on a k-means algorithm to obtain each cluster and a centroid thereof;
2. taking the cluster centroid containing the most number of samples as an initial cluster centroid of the CRP clustering algorithm;
3. representing context windows of all samples of the polysemous words, calculating the similarity between each sample and the centroid of each cluster for all clusters, and obtaining the maximum similarity Smax between the ith sample and the tth cluster centroid;
4. and if the Smax is larger than a preset threshold value alpha, dividing the ith sample into the tth cluster, adding 1 to the number of samples in the t cluster, and recalculating the centroid of the tth cluster. Otherwise, generating a new cluster, wherein the total number K of the clusters is increased by 1, the number of samples in the new cluster is 1, and the centroid of the new cluster is a sample i;
5. the samples in each cluster, the centroid of the cluster, and the total number of clusters are obtained.
Wherein, the 1 st and 2 nd steps can be simplified into that the first sample or a random sample is taken as the initial clustering centroid of the CRP clustering.
In step S1, as shown in fig. 3, training obtains a multi-prototype vector representation of the word, specifically:
1. acquiring all context window representations of the target ambiguous words in the text corpus;
2. clustering the context window representation of the polysemous words based on a CRP algorithm to obtain a clustering cluster represented by word context;
3. for target polysemous words, searching corresponding positions in the original text corpus according to the target words and the context thereof, and performing corresponding category marking on the target text corpus according to the cluster to which the sample belongs, wherein different cluster clusters represent different semantics of the target words;
4. and (3) executing the steps 1, 2 and 3 for each ambiguous word, and marking the category of the clustering cluster into a target text corpus set.
5. And training on the marked text corpus based on a CBOW model to obtain the multi-prototype vector representation of the words.
In step S2, as shown in fig. 4, the ambiguous word recognition and word sense disambiguation based on the word polytype vector representation specifically include:
1. the method for preprocessing the target short text specifically comprises the following steps: removing stop words, and converting traditional characters into simplified characters; replacing English abbreviations in the target sentence by using Chinese words by using a Chinese and English abbreviation dictionary; performing word segmentation processing on the short text; and replacing other characters except the non-Chinese characters and the numbers with special symbols to obtain a word sequence of the short text.
2. Ambiguous words in the sentence are identified. Ambiguous words in the sequence of words are identified based on the word polytype vector representations, the ambiguous words having two or more word vector representations.
3. A contextual window representation of the ambiguous word is computed. The context window represents the weighted average value of the word vectors of the context words, for the polysemous words appearing in the context words, the word vector corresponding to the cluster with the most appearing times of the words in the text corpus is used as the word vector participating in calculation, and the average value of the word vectors in the context window is used for representing the unidentified words.
4. And for the ambiguous terms, disambiguating the ambiguous terms sequentially according to the number of the semantic terms and the sequence of less semantic terms and more semantic terms.
5. And calculating the similarity between the context window representation of the polysemous words in the short text sequence and the centroid of the training sample cluster, and using the word vector representation corresponding to the cluster with the maximum similarity as the word vector representation of the polysemous words.
According to the idea that the semantics of the words are determined by the contexts of the words, the specific semantics of the polysemous words in the contexts are obtained by calculating the similarity between the context window representation of the polysemous words and the text corpus cluster centroids corresponding to the word vectors of the polysemous words, and the word vector representation corresponding to the maximum similarity is used as the word vector representation of the specific semantics of the polysemous words in the contexts, and the specific calculation method comprises the following steps:
vec(w)={veck(w)|k,Sim(veC,veck(w))=Max(Sim(veC,vecj(w)))} (2)
wherein vec (w) is a specific semantic word vector representation corresponding to the polysemous word w in the context window, vecj(w) is a word vector representation of the centroid of the text corpus cluster corresponding to the jth semantic of the ambiguous word w, Max (Sim (veC, veC)j(w))) context window for the ambiguous word w represents veC and each veCj(w) a maximum value of the similarity, and representing a k-th word vector corresponding to the maximum value as a word vector of the specific semantics of the word w.
Interpretation of terms: CRP: the Chinese Restaurant Process is a typical Dirichlet (Dirichlet) Process mixed model, and has the advantages that the number of types of the mixed model is not required to be specified manually, and the method is suitable for the clustering problem in natural language processing.
A word vector polytype representation of a multi-sense word in different contexts. In table 1, unlabeled words correspond to representations of word vectors, such as "apple," which are words that do not distinguish between ambiguities. The term-specific sense corresponds to the term polytypic vector representation, e.g., "apple 2" represents the 2 nd sense of the term "apple", referring to the apple of agricultural produce. The word vector for "apple 1" means that IT corresponds to IT as IT corporation, and "apple 2" means that IT is a fruit. The term polytypic vector represents semantic information that can capture the distinguishing terms.
TABLE 1 CRP method-based words or word senses of closest words
An embodiment of a word sense disambiguation method based on word polytypic vector representations.
The ambiguous word sense disambiguation test dataset came from the Chinese corpus in SemEval-2007# task 5. There are 40 ambiguous words in the test dataset: the meaning of each word is at least two word meanings. The word sense disambiguation test data sets have different word sense numbers of the multi-meaning words, the number of the multi-meaning words is 2-4, and the word with the largest word sense number is 'out', and has 9 word senses. For example, the word "TCM" has two word senses, namely "practioner of Chinese medicine" and "traditional of Chinese medical science", and the word senses are "TCM doctor" and "TCM medicine", and each word sense has different numbers of specific text examples.
In the word meaning disambiguation test example, the word meaning disambiguation method based on word multi-prototype vector representation extracts the multi-meaning words and the context representation thereof in the text example for each given multi-meaning word in the test set, calculates the similarity between the text corpus cluster centroids corresponding to each word vector in the word multi-prototype vector representation to obtain the multi-prototype word vector representation corresponding to the multi-meaning word and the cluster category corresponding to the multi-prototype word vector representation, compares the word meaning category expressed by the multi-prototype word vector representation with the criterion of the test set discrimination to discriminate the correctness of the disambiguation result.
The noun word sense disambiguation results based on the word polytypic vector representation are shown in FIG. 5. The result of verb sense disambiguation based on the word polytype vector representation is shown in FIG. 6.
In information retrieval, the word vector multi-prototype representation and word sense disambiguation method can identify the specific semantics of the polysemous words in the context of the retrieval object, improve the accuracy of word representation, make calculation more reasonable and make the retrieval result more accurate.
In an information retrieval application, in order to recall more similar results to a retrieval word sequence or keyword, similarity (sentence similarity, word similarity) is used to identify similar word sequences or keywords. The similarity of words can be measured by the cosine value of the included angle between two word vectors.
For example, the word vector polytype of the ambiguous word "accounting" is represented by "accounting 1" and "accounting 2", the semantic of "accounting 1" is the meaning of "accounting" or "accounting profit and loss", and the semantic of "accounting 2" is the meaning of "autumn accounting" or "comparing with people after loss or failure". The similarity between "account 1" and the words "settlement" and "reply" is 0.66 and 0.11, respectively, and the similarity between "account 2" and the words "settlement" and "reply" is 0.14 and 0.72, respectively. The similarity between "accounting 1" and "accounting 2" is 0.25, and the difference in similarity between different semantics of the word "accounting" is large.
In the information retrieval, when the retrieval object is a sentence, the similarity between the retrieval object and the retrieval target may be measured using the sentence similarity. And preprocessing the retrieval target to obtain a word sequence of the retrieval target, recording the number of words as m, identifying the polysemous words in the word sequence, obtaining a word vector representation of each word in the word sequence, and recording as a set D. Preprocessing the sentence to be searched to obtain a word sequence of the searched sentence, recording the number of words as n, identifying the polysemous words in the word sequence, obtaining the word vector representation of each word in the word sequence, and recording as a set S.
Respectively calculating the similarity sim (D) between each word in the set D and each word in the set Si,Sj) Extracting m most similar word pairs, searching the similarity Sim (D, S) between the object S and the target D, and obtaining the similarity from a sentence similarity calculation formula:
wherein,and representing the sum of the similarity of m most similar word pairs, wherein m is the number of words in the set D, and n is the number of words in the set S.
For example, the search target includes an ambiguous word "account", the search target is a sentence { account for others }, the search target is a sentence 1{ account for you by a talker 'S house } and a sentence 2{ cause them to recognize harm in the account }, and after preprocessing, the word sequence set D is obtained as { account for others }, S1 is obtained as { account for you by a talker' S house } and S2 is obtained as { cause them to recognize harm in the account }. The ambiguous words in the set of word sequences D, S1, S2 are identified and a word vector representation of each word is obtained. The similarity of each word between the retrieval target D and the retrieval objects S1, S2 is shown in table 2.
TABLE 2 similarity table between words in search target and search target
From equation 3, Sim (D, S1) is 0.62, Sim (D, S2) is 0.39, and the search target D is more matched with the sentence S1, closer to the real context, and more accurate in the search result.
The above disclosure is only for a few specific embodiments of the present invention, however, the present invention is not limited to the above embodiments, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present invention.
Claims (5)
1. The CRP clustering-based word multi-prototype vector representation and word sense disambiguation method is characterized by comprising the following steps of:
step S1, carrying out purification pretreatment on the texts in the mass text corpus to obtain pure texts, clustering the context window representation of the target polysemous words in the text corpus based on a CRP algorithm, marking the target polysemous words in the text corpus according to the cluster category, and training on the marked text corpus to obtain polysemous vector representation of the polysemous words;
the context window representation of the target ambiguous word in the clustered corpus based on the CRP algorithm in step S1 includes the following steps:
step S101, obtaining the context window representation of all samples of the polysemous words in the text corpus;
step S102, obtaining an initial clustering centroid of a CRP clustering algorithm, taking a random sample as the initial clustering centroid of the CRP clustering, or performing initial clustering on the context window representation of the ambiguous word based on a k-means algorithm, and taking the clustering centroid containing the most samples as the initial clustering centroid;
step S103, representing context windows of all samples of the polysemous words, calculating the similarity between each sample and the centroid of each cluster for all clusters, and obtaining the maximum similarity Smax between the ith sample and the centroid of the tth cluster; if Smax is larger than a preset threshold value alpha, dividing the ith sample into the tth cluster, adding 1 to the number of samples in the t cluster, and recalculating the centroid of the tth cluster; otherwise, generating a new cluster, wherein the total number K of the clusters is increased by 1, the number of samples in the new cluster is 1, and the centroid of the new cluster is a sample i;
step S104, obtaining samples in each cluster, the centroid of each cluster and the total number of clusters;
step S2, preprocessing the target short text to obtain a word sequence of the short text, identifying the target polysemous words in the word sequence, calculating the similarity between the context window representation of the target polysemous words and the centroid of each cluster corresponding to the words in the text corpus, representing the word vector corresponding to the cluster category with the maximum similarity as the word vector representation of the polysemous words with specific word senses in the context, and disambiguating the word senses of the polysemous words;
the word sense disambiguation of the ambiguous word in step S2 includes the following steps:
step S301, preprocessing the target short text to obtain a word sequence of the short text, and identifying polysemous words in the word sequence according to the multi-prototype vector representation of the words;
step S302, carrying out word sense disambiguation on the polysemous words, calculating the similarity between the context window representation of the words in the short text word sequence and the centroids of all clustering clusters corresponding to the words in the text corpus, and extracting word vector representation corresponding to the clustering cluster category with the maximum similarity as word vector representation of the polysemous words expressing specific word senses in the context.
2. The method as claimed in claim 1, wherein the step S1 of refining and preprocessing the text in the corpus of massive texts to obtain plain texts comprises: deleting text with the number of words less than a preset threshold; the complex characters are uniformly converted into simplified characters; replacing Chinese and English abbreviations of the text corpus by using Chinese words by using a Chinese and English abbreviation dictionary; segmenting the texts in the text corpus set; removing stop words; deleting other characters except non-Chinese characters and numbers; counting word frequency; presetting the word frequency of the high-frequency words as an upper limit threshold; selecting words with the occurrence frequency of the text corpus set larger than a preset lower limit threshold value to establish a word list; and establishing an ambiguous word list based on the ambiguous word dictionary.
3. The method for word polytypic vector representation and word sense disambiguation based on CRP clustering as claimed in claim 1, wherein said context window representation of the target polysemous word in step S1 is obtained by averaging word vectors of words in word context, and the specific calculation formula is:
where veC is the contextual window representation of the word, wiIs the ith word in the word Context window word set Context, vec (w)i) Is the word wiThe initial word vector of (2).
4. The method for word polytypic vector representation and word sense disambiguation based on CRP clustering as recited in claim 1, wherein the training on the corpus of labeled text at step S1 obtains a polytypic vector representation of a polysemous word, the method comprising the steps of:
step S201, marking all samples of the target polysemous words in the text corpus according to the belonged clustering clusters, wherein different clustering clusters represent different word senses of the target words;
step S202, executing a word vector representation training process based on a neural network language model on the marked cluster, and obtaining multi-prototype vector representation of words expressing specific word senses in different contexts.
5. The CRP cluster-based word polytypic vector representation and word sense disambiguation method of claim 1, characterized in that the preprocessing of the target short text to obtain the word sequence of the short text in step S2 comprises removing stop words, and converting traditional words into simplified words; replacing English abbreviations in the target short text with Chinese words by using a Chinese and English abbreviation dictionary; performing word segmentation processing on the short text; characters other than chinese characters and numerals are replaced with special symbols.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810783010.5A CN109033307B (en) | 2018-07-17 | 2018-07-17 | CRP clustering-based word multi-prototype vector representation and word sense disambiguation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810783010.5A CN109033307B (en) | 2018-07-17 | 2018-07-17 | CRP clustering-based word multi-prototype vector representation and word sense disambiguation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109033307A CN109033307A (en) | 2018-12-18 |
CN109033307B true CN109033307B (en) | 2021-08-31 |
Family
ID=64643470
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810783010.5A Expired - Fee Related CN109033307B (en) | 2018-07-17 | 2018-07-17 | CRP clustering-based word multi-prototype vector representation and word sense disambiguation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109033307B (en) |
Families Citing this family (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109783806B (en) * | 2018-12-21 | 2023-05-02 | 众安信息技术服务有限公司 | Text matching method utilizing semantic parsing structure |
CN109740162B (en) * | 2019-01-09 | 2023-07-11 | 安徽省泰岳祥升软件有限公司 | Text representation method, device and medium |
CN109960799B (en) * | 2019-03-12 | 2021-07-27 | 中南大学 | Short text-oriented optimization classification method |
CN110532395B (en) * | 2019-05-13 | 2021-09-28 | 南京大学 | Semantic embedding-based word vector improvement model establishing method |
CN110309515B (en) * | 2019-07-10 | 2023-08-11 | 北京奇艺世纪科技有限公司 | Entity identification method and device |
CN110705274B (en) * | 2019-09-06 | 2023-03-24 | 电子科技大学 | Fusion type word meaning embedding method based on real-time learning |
CN112579769A (en) * | 2019-09-30 | 2021-03-30 | 北京国双科技有限公司 | Keyword clustering method and device, storage medium and electronic equipment |
CN110717015B (en) * | 2019-10-10 | 2021-03-26 | 大连理工大学 | Neural network-based polysemous word recognition method |
CN110765781B (en) * | 2019-12-11 | 2023-07-14 | 沈阳航空航天大学 | Man-machine collaborative construction method for domain term semantic knowledge base |
CN111159337A (en) * | 2019-12-20 | 2020-05-15 | 中国建设银行股份有限公司 | Chemical expression extraction method, device and equipment |
CN111310475B (en) * | 2020-02-04 | 2023-03-10 | 支付宝(杭州)信息技术有限公司 | Training method and device of word sense disambiguation model |
CN111414523A (en) * | 2020-03-11 | 2020-07-14 | 中国建设银行股份有限公司 | Data acquisition method and device |
CN113449102A (en) * | 2020-03-27 | 2021-09-28 | 北京京东拓先科技有限公司 | Text clustering method, equipment and storage medium |
CN111507098B (en) * | 2020-04-17 | 2023-03-21 | 腾讯科技(深圳)有限公司 | Ambiguous word recognition method and device, electronic equipment and computer-readable storage medium |
CN111523312B (en) * | 2020-04-22 | 2023-06-16 | 南京贝湾信息科技有限公司 | Word searching display method and device based on paraphrasing disambiguation and computing equipment |
CN113298103A (en) * | 2020-05-27 | 2021-08-24 | 阿里巴巴集团控股有限公司 | Vector clustering training method and device |
CN111783418B (en) * | 2020-06-09 | 2024-04-05 | 北京北大软件工程股份有限公司 | Chinese word meaning representation learning method and device |
CN111914569B (en) * | 2020-08-10 | 2023-07-21 | 安天科技集团股份有限公司 | Fusion map-based prediction method and device, electronic equipment and storage medium |
CN114330327A (en) * | 2020-10-09 | 2022-04-12 | 阿里巴巴集团控股有限公司 | Language model pre-training method and apparatus, computer storage medium and electronic device |
CN113761196B (en) * | 2021-07-28 | 2024-02-20 | 北京中科模识科技有限公司 | Text clustering method and system, electronic equipment and storage medium |
CN113723116B (en) * | 2021-08-25 | 2024-02-13 | 中国科学技术大学 | Text translation method and related device, electronic equipment and storage medium |
CN113723101A (en) * | 2021-09-09 | 2021-11-30 | 国网电子商务有限公司 | Word sense disambiguation method and device applied to intention recognition |
CN114943235A (en) * | 2022-07-12 | 2022-08-26 | 长安大学 | Named entity recognition method based on multi-class language model |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103970729A (en) * | 2014-04-29 | 2014-08-06 | 河海大学 | Multi-subject extracting method based on semantic categories |
CN104008090A (en) * | 2014-04-29 | 2014-08-27 | 河海大学 | Multi-subject extraction method based on concept vector model |
CN104778186A (en) * | 2014-01-15 | 2015-07-15 | 阿里巴巴集团控股有限公司 | Method and system for hanging commodity object to standard product unit (SPU) |
CN106598947A (en) * | 2016-12-15 | 2017-04-26 | 山西大学 | Bayesian word sense disambiguation method based on synonym expansion |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7685201B2 (en) * | 2006-09-08 | 2010-03-23 | Microsoft Corporation | Person disambiguation using name entity extraction-based clustering |
US9830379B2 (en) * | 2010-11-29 | 2017-11-28 | Google Inc. | Name disambiguation using context terms |
US20160292149A1 (en) * | 2014-08-02 | 2016-10-06 | Google Inc. | Word sense disambiguation using hypernyms |
CN104778158B (en) * | 2015-03-04 | 2018-07-17 | 新浪网技术(中国)有限公司 | A kind of document representation method and device |
CN104731771A (en) * | 2015-03-27 | 2015-06-24 | 大连理工大学 | Term vector-based abbreviation ambiguity elimination system and method |
CN107861939B (en) * | 2017-09-30 | 2021-05-14 | 昆明理工大学 | Domain entity disambiguation method fusing word vector and topic model |
-
2018
- 2018-07-17 CN CN201810783010.5A patent/CN109033307B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104778186A (en) * | 2014-01-15 | 2015-07-15 | 阿里巴巴集团控股有限公司 | Method and system for hanging commodity object to standard product unit (SPU) |
CN103970729A (en) * | 2014-04-29 | 2014-08-06 | 河海大学 | Multi-subject extracting method based on semantic categories |
CN104008090A (en) * | 2014-04-29 | 2014-08-27 | 河海大学 | Multi-subject extraction method based on concept vector model |
CN106598947A (en) * | 2016-12-15 | 2017-04-26 | 山西大学 | Bayesian word sense disambiguation method based on synonym expansion |
Non-Patent Citations (2)
Title |
---|
Contextual word sense tuning and disambiguation;Basili,R等;《APPLIED ARTIFICIAL INTELLIGENCE》;19970531;第235-262页 * |
融合句义特征的人名消歧及人物关系抽取技术研究;张晗;《中国优秀硕士学位论文全文数据库(电子期刊)》;20150715;第I138-1525页 * |
Also Published As
Publication number | Publication date |
---|---|
CN109033307A (en) | 2018-12-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109033307B (en) | CRP clustering-based word multi-prototype vector representation and word sense disambiguation method | |
CN109635297B (en) | Entity disambiguation method and device, computer device and computer storage medium | |
CN112231447B (en) | Method and system for extracting Chinese document events | |
CN111737496A (en) | Power equipment fault knowledge map construction method | |
CN110413787B (en) | Text clustering method, device, terminal and storage medium | |
CN112101027A (en) | Chinese named entity recognition method based on reading understanding | |
CN109086355B (en) | Hot-spot association relation analysis method and system based on news subject term | |
CN111626042B (en) | Reference digestion method and device | |
CN112559684A (en) | Keyword extraction and information retrieval method | |
CN110807326A (en) | Short text keyword extraction method combining GPU-DMM and text features | |
Chang et al. | A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING. | |
CN111191464A (en) | Semantic similarity calculation method based on combined distance | |
CN111310467B (en) | Topic extraction method and system combining semantic inference in long text | |
CN110750646A (en) | Attribute description extracting method for hotel comment text | |
CN114491062B (en) | Short text classification method integrating knowledge graph and topic model | |
CN114064901B (en) | Book comment text classification method based on knowledge graph word meaning disambiguation | |
CN111930936A (en) | Method and system for excavating platform message text | |
Oo et al. | An analysis of ambiguity detection techniques for software requirements specification (SRS) | |
CN109344233B (en) | Chinese name recognition method | |
CN110705295B (en) | Entity name disambiguation method based on keyword extraction | |
CN117131932A (en) | Semi-automatic construction method and system for domain knowledge graph ontology based on topic model | |
CN112000782A (en) | Intelligent customer service question-answering system based on k-means clustering algorithm | |
CN114842301A (en) | Semi-supervised training method of image annotation model | |
Priyadarshi et al. | The first named entity recognizer in Maithili: Resource creation and system development | |
Lenci et al. | Unsupervised Acquisition of Verb Subcategorization Frames from Shallow-Parsed Corpora. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210831 |