CN110717015B - Neural network-based polysemous word recognition method - Google Patents
Neural network-based polysemous word recognition method Download PDFInfo
- Publication number
- CN110717015B CN110717015B CN201910956103.8A CN201910956103A CN110717015B CN 110717015 B CN110717015 B CN 110717015B CN 201910956103 A CN201910956103 A CN 201910956103A CN 110717015 B CN110717015 B CN 110717015B
- Authority
- CN
- China
- Prior art keywords
- word
- context
- polysemous
- words
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3335—Syntactic pre-processing, e.g. stopword elimination, stemming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/374—Thesaurus
Abstract
The invention provides a neural network-based polysemous word recognition method, and belongs to the field of data mining and natural language processing. The method mainly utilizes the semantics of the context in the text to identify the polysemous words and generate the polysemous word representation, and comprises five steps: 1) preprocessing the corpus; 2) pre-training word representation; 3) extracting a context; 4) identifying ambiguous words; 5) selection of an ambiguous word representation. The invention fully utilizes the excellent characteristics of the word vector and automatically identifies the polysemous words through the context semantic difference of the words. Meanwhile, in a specific task, the invention also provides a method for selecting the polysemous word representation through the context of the polysemous word, so that the text representation quality is improved, and the task accuracy is also improved. In addition, the implementation process of the invention is simple and convenient, and has good applicability.
Description
Technical Field
The invention belongs to the field of data mining and natural language processing, and particularly relates to a neural network-based polysemous word recognition method which can be applied to a plurality of natural language processing tasks such as text classification and emotion analysis.
Background
Word representation is a fundamental and important task in the fields of data mining and natural language processing. In recent years, neural network based approaches have attracted attention to learning distributed representations of words. Among them, the famous word2vec model is distinguished by high efficiency and easy use. Word2vec trains target words using their contexts and maps words of similar meaning to similar points in vector space. This model has been successful in many tasks based on generating high quality word representations, such as language modeling, text understanding, and machine translation.
Ambiguous word recognition is a popular research problem in natural language processing. The polysemous words refer to words with two or more meanings, most of which are common words with the most close relationship with life, and most of which are verbs and adjectives. The polysemous words can obtain good expression effect in the paraphrase, metaphor, lending and other paraphrases due to the characteristic of polysemous. The task of ambiguous word recognition is to allow a computer to automatically recognize ambiguous words present in a given paragraph or sentence and to give the word a more accurate word representation. The method has important significance in identifying the polysemous words, can improve the quality of word representation and segment representation, can more accurately dig out the emotion expressed by sentences, and improves the accuracy of natural language processing tasks.
At present, people have less research on polysemous word recognition. The existing method only trains each word in the text into a plurality of word representations blindly and does not achieve the purpose of automatic recognition. In addition, this approach not only consumes a lot of training time, but also occupies a lot of storage resources.
Disclosure of Invention
The invention aims to provide a neural network-based polysemous word recognition method, which can automatically identify polysemous words in a text according to context semantics and generate word representations closer to the context semantics for each polysemous word, thereby obtaining high-quality text representations and improving the accuracy of natural language processing tasks.
The technical scheme of the invention is as follows: a polysemous word recognition method based on neural network includes the following steps:
first, pre-processing corpus
1.1) selecting a corpus in a natural language processing task, and deleting special characters and non-recognizable characters in the text.
Second, pre-training word representation
2.1) pre-training word vectors by using a word vector training tool for the preprocessed linguistic data. We can select various models such as CBOW model in word2vec, doc2vecC, and improved models based on them. FIG. 1 shows a schematic representation of the CBOW model and the doc2vecC model.
2.2) after the pre-training is finished, storing a word-word vector mapping table.
Third, extracting context
3.1) define a new context window and rescan the whole corpus to extract the context of each word in different sentences.
3.2) counting words in the contexts, deleting repeated words, and generating a context dictionary corresponding to each word. Each line of the dictionary records a set of words that appear in the context of a word.
3.3) mapping each context dictionary in the step 3.2) with a corresponding word to construct a word-context dictionary mapping table.
Fourthly, recognizing the ambiguous word
4.1) loading the word-context dictionary mapping table obtained in the step 3.3), and respectively carrying out k-means clustering (k is more than or equal to 2) on the context corresponding to each word in the mapping table. Before clustering operation, words in the context need to be converted into corresponding word vector forms according to the word-word vector mapping table obtained in step 2.2). After clustering, we can obtain the category to which each word in the context dictionary belongs, and the center vector of each category.
4.2) evaluating the clustering result of the context of each word in the mapping table by using a clustering evaluation algorithm. The cluster evaluation algorithm may use contour coefficients, CH indices, etc., among others. The cluster evaluation algorithm needs to take the word representation participating in the cluster and the category to which the word belongs as input, and output as an evaluation value. And if the context evaluation result of a word is greater than a predefined threshold value, judging the word as a polysemous word.
4.3) outputting the polysemous word, and using the central vector of each category obtained in the step 4.1) of the polysemous word as a word representation of different word senses.
Fifth, selection of ambiguous word representation
The above steps have completed the recognition of the polysemous words, and also have obtained the word representation of different word senses of each polysemous word. In a specific task, the quality of text representation can be improved by using polysemous word representation conforming to the current semantics, and the accuracy of the task is improved. Next, we introduce the operation steps of selecting an ambiguous word representation:
5.1) rescanning words in the corpus, once the target word appears in the polysemous table, we need to select a word representation for the polysemous that conforms to the current context semantics.
5.2) obtaining the context of the ambiguous word by using the context window.
5.3) obtaining the word vector of the word in the context from the word-word vector mapping table in the step 2.2), and calculating the arithmetic mean of the word vector and the word vector as the context vector.
5.4) calculating the distance between the context vector of the word and the word representation of its different sense, respectively. The distance can be measured in various ways such as euclidean distance and cosine distance.
5.5) finally selecting the polysemous word vector closest to the context vector as the word representation of the polysemous word in the current context. Figure 2 shows a technical scheme of the invention.
The invention has the beneficial effects that: the excellent characteristics of the word vector are fully utilized, the polysemous words are identified through the context semantic difference of the words, and the automatic identification is really realized. Meanwhile, in a specific task, the invention also provides a method for selecting the polysemous word representation through the context of the polysemous word, so that the text representation quality is improved, and the task accuracy is also improved. In addition, the implementation process of the invention is simple and convenient, and has good applicability.
Drawings
FIG. 1 is a schematic representation of the CBOW model and the doc2vecC model. Wherein, (a) represents a model architecture of CBOW; (b) a model architecture of doc2vecC is represented.
Fig. 2 is a technical scheme diagram of ambiguous word recognition.
Detailed Description
The specific embodiments described are merely illustrative of implementations of the invention and do not limit the scope of the invention. Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. As shown in fig. 2, the overall implementation flow includes five steps, and the following is a detailed description for each step:
first, pre-processing corpus
1.1) selecting a corpus of natural language processing tasks, denoted as D ═ D1,...,DnThe corpus contains n paragraphs, DiRepresenting the ith paragraph in corpus D. Delete each paragraph DiAfter the special characters and the non-recognizable characters in the Chinese character sequence are obtained, character sequences with different lengths are recorded as
Second, pre-training word representation
2.1) pre-training word vectors by using a word vector training tool for the preprocessed linguistic data. We can select various models such as CBOW model in word2vec, doc2vecC, and improved models based on them. Here we take CBOW model in word2vec as an example for detailed explanation.
The CBOW model contains three network layers: an input layer, a hidden layer, and an output layer. At the input level, the model defines a local context window, denoted c. The window represents that c words before and after the target word are taken, namely the total word number of the context window is |2c |. Model as paragraph DiTraining is performed in units and word representations in a context window are taken as input. At the hidden layer, the model sums the word vectors of the context window of the input, notedAnd finally, predicting the probability value corresponding to each word of the target word in the word list by the output layer. The training process of the model is to continuously predict the target words by using the words in the context window in the current paragraph, and to maximize the objective function L of the model,
wherein the content of the first and second substances,indicating that the target word is paragraph D in the corpusiThe number t of the word (a) in (b),to representV denotes a dictionary of the current corpus.
Before training begins, the parameters of the model need to be set. The dimension of the word representation can be set to be between 100 and 1000, and the size of the context window c is set to be between 2 and 10. Because our proposed method requires identifying ambiguous words based on context semantics, precise context information needs to be preserved in the text. Therefore, we retain all low frequency words. Other parameters use default values.
2.2) after the pre-training is finished, obtaining a word-word vector mapping table and marking the mapping table as
Third, extracting context
3.1) define a new contextual window whose size cannot be larger than the one set in step 2.1). For example, if the size of the contextual window defined using the CBOW model is 5, then the size range of the new contextual window set at this step should be between 1-5. Rescanning the entire corpus using the context window, extracting the context of each word in different sentences in the word-word vector mapping table, generating a setAndrespectively represent words w1M context and word wnThe kth context of (1). Furthermore, the number of contexts for different words in a set need not be the same.
3.2) counting words in the contexts, deleting repeated words, generating a context dictionary for each target word and recording the context dictionaryI.e. wtA set of all words in the context of (a).
3.3) constructing a mapping table of words and their context dictionaries, and recording asEach row of this mapping table records a word in the dictionary and the set of all words that have appeared in its context.
Fourthly, recognizing the ambiguous word
4.1) loading the mapping table obtained in the step 3.3), and mapping the context dictionary corresponding to each word in the mapping tableAnd respectively carrying out k-means clustering. k represents the number of classes of the cluster and different classes represent different contextual semantics. We use the first line in the mapping tableFor example, whereinBefore the clustering operation is carried out,the words in the table are converted into corresponding word vector forms according to the word-word vector mapping table obtained in step 2.2). Equation (2) is the objective function of the k-means algorithm,
wherein the content of the first and second substances,and isIs shown in the word w1Into a set of words of class i,denotes w1The vector representation of the h-th word in the context set,representing a set of categoriesThe central vector of (2). After clustering, we can get the set of context words for each categoryAnd corresponding set of center vectorsAfter all the words in the mapping table are operated in the same way, a set is obtainedAnd collections
In a specific implementation, the set range of the cluster category is 2-5. Since most of the ambiguous words in the corpus basically contain 2 word senses, and there are few more word senses, we usually set 2 for the context cluster category of all words.
4.2) evaluating the clustering result of the context of each word in the mapping table by using a clustering evaluation algorithm. The cluster evaluation algorithm may use contour coefficients, CH indices, etc., among others. We take the contour coefficient as an example, for the word wtThe context clustering result is evaluated, and the specific operation steps are as follows:
4.2.1) Loading the set C obtained in step 4.1) and taking out the word wtContext clustering result set of (2), i.e.
4.2.2) calculating the contour coefficient of each word in the set. First, a single word in the set is fetchedCalculating the average distance from the word to other words in the same category and recording asIs called asWordInter-cluster dissimilarity.The smaller the word is, the more the word should be clustered into this class. Calculating wordsAverage distance to all words of other categoriesCalled a wordDissimilarity to other categories.The larger the word, the less the word belongs to the other category. Then, according to the wordInter-cluster dissimilarity ofDegree of dissimilarity with clustersDefinition wordThe profile coefficients of (a) are:
wherein d represents wtTotal number of words in the set of contexts. The output range of the contour coefficient is [ -1,1]Higher values indicate better clustering. Evaluating the context clustering results of all the words in the set C to obtain a result set S ═ S (w)1),...,S(wn)}。
4.2.4) define a threshold value alpha for ambiguous word discrimination, each value in the set S is compared to alpha if S (w)t)>Alpha, then the word w is determinedtIs a polysemous word.
4.3) outputting the polysemous word, and using the polysemous word to obtain the central vector of each category in the step 4.1)Word representations that are different word senses.
Fifth, selection of ambiguous word representation
The above steps have completed the recognition of the polysemous words, and also have obtained the word representation of different word senses of each polysemous word. In a specific task, the quality of text representation can be improved by using polysemous word representation conforming to the current semantics, and the accuracy of the task is improved. Next, we present in detail the operation steps of selecting an ambiguous word representation:
5.1) rescanning words in the corpus, once the target word appears in the polysemous table, we need to select a word representation for the polysemous that conforms to the current context semantics. Let the currently processed sentence be DtWord wt∈DtRepresenting the ambiguous word in the paragraph.
5.2) obtaining the polysemous word w by using the context window defined in the step 3.1)tIs described as
5.3) obtaining the word vector of the word in the context from the word-word vector mapping table in the step 2.2), and calculating the arithmetic mean of the word vector and the word vector as the context vector,
5.4) separately calculatingAnd wtWord representation of different word sensesThe distance between them. The distance measurement can be performed in various ways such as euclidean distance formula (6) and cosine distance formula (7).
5.5) Final Slave CollectionTo select one of wtContext vectorThe closest vector is represented as the word of the ambiguous word in the current context.
The foregoing is a more detailed description of the present invention in connection with specific preferred embodiments and is not intended to limit the practice of the invention to these embodiments. For those skilled in the art to which the invention pertains, several simple deductions or substitutions may be made without departing from the spirit of the invention, which should be construed as belonging to the protection of the present invention.
Claims (7)
1. A polysemous word recognition method based on a neural network is characterized by comprising the following steps:
first, pre-processing corpus
1.1) selecting a corpus in a natural language processing task, and deleting special characters and non-recognizable characters in a text;
second, pre-training word representation
2.1) pre-training word vectors by using a word vector training tool for the preprocessed corpus; the word vector training tool comprises word2vec, doc2vecC and an improved model based on the word2vec and the doc2 vecC;
2.2) after the pre-training is finished, storing a word-word vector mapping table;
third, extracting context
3.1) defining a new context window, rescanning the whole corpus and extracting the context of each word in different sentences;
3.2) counting words in the context corresponding to each word, deleting repeated words, and generating a context dictionary corresponding to each word; each line of the dictionary records a set of words that appear in the context of a word;
3.3) mapping each context dictionary in the step 3.2) with corresponding words to construct a word-context dictionary mapping table;
fourthly, recognizing the ambiguous word
4.1) loading the word-context dictionary mapping table obtained in the step 3.3), and respectively carrying out k-means clustering on the context corresponding to each word in the mapping table, wherein k is more than or equal to 2; before clustering operation, words in the context need to be converted into corresponding word vector forms according to the word-word vector mapping table obtained in the step 2.2); after clustering operation, obtaining the category to which each word in the context dictionary belongs and the central vector of each category;
4.2) evaluating the clustering result of the context of each word in the mapping table by using a clustering evaluation algorithm; the clustering evaluation algorithm needs to take word representation participating in clustering and the category of the word as input and output an evaluation value; when the evaluation result of the context of a word is larger than a predefined threshold value, judging the word as a polysemous word;
4.3) outputting a polysemous word, and using the central vector of each category obtained in the step 4.1) of the polysemous word as word representation of different word senses;
fifth, selection of ambiguous word representation
5.1) rescanning words in the corpus, and once a target word appears in the polysemous word list, selecting a word expression which accords with the current context semantics for the polysemous word;
5.2) obtaining the context of the ambiguous word by using a context window;
5.3) obtaining word vectors of the words in the context from the word-word vector mapping table in the step 2.2), and calculating the arithmetic mean of the word vectors as the context vector;
5.4) calculating the distance between the context vector of the word and the word representation of different word senses of the word respectively;
5.5) finally selecting the polysemous word vector closest to the context vector as the word representation of the polysemous word in the current context.
2. The method according to claim 1, wherein the corpus of step 1.1) is an arbitrary corpus related to text representation.
3. The method for identifying ambiguous word based on neural network as claimed in claim 1 or 2, wherein said new context window of step 3.1) is the same as the context window in word2vec for defining the range of extracting context; the new contextual window size defined in step 3.1) cannot be larger than the window size defined when the pre-training words are represented in step 2.1).
4. The neural network-based ambiguous word recognition method of claim 1 or 2, wherein said cluster evaluation algorithm of step 4.2) comprises contour coefficients and CH indices.
5. The neural network-based ambiguous word recognition method of claim 3, wherein said cluster evaluation algorithm of step 4.2) comprises contour coefficients and CH indices.
6. The neural network-based ambiguous word recognition method of claim 1, 2 or 5, wherein said context window of step 5.2) is consistent with the context window defined in step 3.1); and 5.4) adopting an Euclidean distance or cosine distance as the distance measurement mode.
7. The neural network-based ambiguous word recognition method of claim 3, wherein said context window of step 5.2) is consistent with the context window defined in step 3.1); and 5.4) adopting an Euclidean distance or cosine distance as the distance measurement mode.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910956103.8A CN110717015B (en) | 2019-10-10 | 2019-10-10 | Neural network-based polysemous word recognition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910956103.8A CN110717015B (en) | 2019-10-10 | 2019-10-10 | Neural network-based polysemous word recognition method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110717015A CN110717015A (en) | 2020-01-21 |
CN110717015B true CN110717015B (en) | 2021-03-26 |
Family
ID=69212371
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910956103.8A Active CN110717015B (en) | 2019-10-10 | 2019-10-10 | Neural network-based polysemous word recognition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110717015B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113255345B (en) * | 2021-06-10 | 2021-10-15 | 腾讯科技(深圳)有限公司 | Semantic recognition method, related device and equipment |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9213687B2 (en) * | 2009-03-23 | 2015-12-15 | Lawrence Au | Compassion, variety and cohesion for methods of text analytics, writing, search, user interfaces |
CN106909537B (en) * | 2017-02-07 | 2020-04-07 | 中山大学 | One-word polysemous analysis method based on topic model and vector space |
CN107861939B (en) * | 2017-09-30 | 2021-05-14 | 昆明理工大学 | Domain entity disambiguation method fusing word vector and topic model |
CN109033307B (en) * | 2018-07-17 | 2021-08-31 | 华北水利水电大学 | CRP clustering-based word multi-prototype vector representation and word sense disambiguation method |
-
2019
- 2019-10-10 CN CN201910956103.8A patent/CN110717015B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN110717015A (en) | 2020-01-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113011533B (en) | Text classification method, apparatus, computer device and storage medium | |
CN110209806B (en) | Text classification method, text classification device and computer readable storage medium | |
CN110321925B (en) | Text multi-granularity similarity comparison method based on semantic aggregated fingerprints | |
CN110825877A (en) | Semantic similarity analysis method based on text clustering | |
CN110019732B (en) | Intelligent question answering method and related device | |
CN110263325B (en) | Chinese word segmentation system | |
CN110543564B (en) | Domain label acquisition method based on topic model | |
CN110097096B (en) | Text classification method based on TF-IDF matrix and capsule network | |
CN111738007A (en) | Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network | |
CN108228541A (en) | The method and apparatus for generating documentation summary | |
CN110633365A (en) | Word vector-based hierarchical multi-label text classification method and system | |
CN114969275A (en) | Conversation method and system based on bank knowledge graph | |
CN108846033B (en) | Method and device for discovering specific domain vocabulary and training classifier | |
CN113836896A (en) | Patent text abstract generation method and device based on deep learning | |
CN114997288A (en) | Design resource association method | |
CN112989813A (en) | Scientific and technological resource relation extraction method and device based on pre-training language model | |
CN112860898B (en) | Short text box clustering method, system, equipment and storage medium | |
CN114491062A (en) | Short text classification method fusing knowledge graph and topic model | |
CN116629258B (en) | Structured analysis method and system for judicial document based on complex information item data | |
CN110717015B (en) | Neural network-based polysemous word recognition method | |
Tianxiong et al. | Identifying chinese event factuality with convolutional neural networks | |
CN111125329B (en) | Text information screening method, device and equipment | |
Thielmann et al. | Coherence based document clustering | |
CN117057346A (en) | Domain keyword extraction method based on weighted textRank and K-means | |
Vijayaraju | Image retrieval using image captioning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |