CN110717015B - Neural network-based polysemous word recognition method - Google Patents

Neural network-based polysemous word recognition method Download PDF

Info

Publication number
CN110717015B
CN110717015B CN201910956103.8A CN201910956103A CN110717015B CN 110717015 B CN110717015 B CN 110717015B CN 201910956103 A CN201910956103 A CN 201910956103A CN 110717015 B CN110717015 B CN 110717015B
Authority
CN
China
Prior art keywords
word
context
polysemous
words
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910956103.8A
Other languages
Chinese (zh)
Other versions
CN110717015A (en
Inventor
姚念民
郭顺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN201910956103.8A priority Critical patent/CN110717015B/en
Publication of CN110717015A publication Critical patent/CN110717015A/en
Application granted granted Critical
Publication of CN110717015B publication Critical patent/CN110717015B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus

Abstract

The invention provides a neural network-based polysemous word recognition method, and belongs to the field of data mining and natural language processing. The method mainly utilizes the semantics of the context in the text to identify the polysemous words and generate the polysemous word representation, and comprises five steps: 1) preprocessing the corpus; 2) pre-training word representation; 3) extracting a context; 4) identifying ambiguous words; 5) selection of an ambiguous word representation. The invention fully utilizes the excellent characteristics of the word vector and automatically identifies the polysemous words through the context semantic difference of the words. Meanwhile, in a specific task, the invention also provides a method for selecting the polysemous word representation through the context of the polysemous word, so that the text representation quality is improved, and the task accuracy is also improved. In addition, the implementation process of the invention is simple and convenient, and has good applicability.

Description

Neural network-based polysemous word recognition method
Technical Field
The invention belongs to the field of data mining and natural language processing, and particularly relates to a neural network-based polysemous word recognition method which can be applied to a plurality of natural language processing tasks such as text classification and emotion analysis.
Background
Word representation is a fundamental and important task in the fields of data mining and natural language processing. In recent years, neural network based approaches have attracted attention to learning distributed representations of words. Among them, the famous word2vec model is distinguished by high efficiency and easy use. Word2vec trains target words using their contexts and maps words of similar meaning to similar points in vector space. This model has been successful in many tasks based on generating high quality word representations, such as language modeling, text understanding, and machine translation.
Ambiguous word recognition is a popular research problem in natural language processing. The polysemous words refer to words with two or more meanings, most of which are common words with the most close relationship with life, and most of which are verbs and adjectives. The polysemous words can obtain good expression effect in the paraphrase, metaphor, lending and other paraphrases due to the characteristic of polysemous. The task of ambiguous word recognition is to allow a computer to automatically recognize ambiguous words present in a given paragraph or sentence and to give the word a more accurate word representation. The method has important significance in identifying the polysemous words, can improve the quality of word representation and segment representation, can more accurately dig out the emotion expressed by sentences, and improves the accuracy of natural language processing tasks.
At present, people have less research on polysemous word recognition. The existing method only trains each word in the text into a plurality of word representations blindly and does not achieve the purpose of automatic recognition. In addition, this approach not only consumes a lot of training time, but also occupies a lot of storage resources.
Disclosure of Invention
The invention aims to provide a neural network-based polysemous word recognition method, which can automatically identify polysemous words in a text according to context semantics and generate word representations closer to the context semantics for each polysemous word, thereby obtaining high-quality text representations and improving the accuracy of natural language processing tasks.
The technical scheme of the invention is as follows: a polysemous word recognition method based on neural network includes the following steps:
first, pre-processing corpus
1.1) selecting a corpus in a natural language processing task, and deleting special characters and non-recognizable characters in the text.
Second, pre-training word representation
2.1) pre-training word vectors by using a word vector training tool for the preprocessed linguistic data. We can select various models such as CBOW model in word2vec, doc2vecC, and improved models based on them. FIG. 1 shows a schematic representation of the CBOW model and the doc2vecC model.
2.2) after the pre-training is finished, storing a word-word vector mapping table.
Third, extracting context
3.1) define a new context window and rescan the whole corpus to extract the context of each word in different sentences.
3.2) counting words in the contexts, deleting repeated words, and generating a context dictionary corresponding to each word. Each line of the dictionary records a set of words that appear in the context of a word.
3.3) mapping each context dictionary in the step 3.2) with a corresponding word to construct a word-context dictionary mapping table.
Fourthly, recognizing the ambiguous word
4.1) loading the word-context dictionary mapping table obtained in the step 3.3), and respectively carrying out k-means clustering (k is more than or equal to 2) on the context corresponding to each word in the mapping table. Before clustering operation, words in the context need to be converted into corresponding word vector forms according to the word-word vector mapping table obtained in step 2.2). After clustering, we can obtain the category to which each word in the context dictionary belongs, and the center vector of each category.
4.2) evaluating the clustering result of the context of each word in the mapping table by using a clustering evaluation algorithm. The cluster evaluation algorithm may use contour coefficients, CH indices, etc., among others. The cluster evaluation algorithm needs to take the word representation participating in the cluster and the category to which the word belongs as input, and output as an evaluation value. And if the context evaluation result of a word is greater than a predefined threshold value, judging the word as a polysemous word.
4.3) outputting the polysemous word, and using the central vector of each category obtained in the step 4.1) of the polysemous word as a word representation of different word senses.
Fifth, selection of ambiguous word representation
The above steps have completed the recognition of the polysemous words, and also have obtained the word representation of different word senses of each polysemous word. In a specific task, the quality of text representation can be improved by using polysemous word representation conforming to the current semantics, and the accuracy of the task is improved. Next, we introduce the operation steps of selecting an ambiguous word representation:
5.1) rescanning words in the corpus, once the target word appears in the polysemous table, we need to select a word representation for the polysemous that conforms to the current context semantics.
5.2) obtaining the context of the ambiguous word by using the context window.
5.3) obtaining the word vector of the word in the context from the word-word vector mapping table in the step 2.2), and calculating the arithmetic mean of the word vector and the word vector as the context vector.
5.4) calculating the distance between the context vector of the word and the word representation of its different sense, respectively. The distance can be measured in various ways such as euclidean distance and cosine distance.
5.5) finally selecting the polysemous word vector closest to the context vector as the word representation of the polysemous word in the current context. Figure 2 shows a technical scheme of the invention.
The invention has the beneficial effects that: the excellent characteristics of the word vector are fully utilized, the polysemous words are identified through the context semantic difference of the words, and the automatic identification is really realized. Meanwhile, in a specific task, the invention also provides a method for selecting the polysemous word representation through the context of the polysemous word, so that the text representation quality is improved, and the task accuracy is also improved. In addition, the implementation process of the invention is simple and convenient, and has good applicability.
Drawings
FIG. 1 is a schematic representation of the CBOW model and the doc2vecC model. Wherein, (a) represents a model architecture of CBOW; (b) a model architecture of doc2vecC is represented.
Fig. 2 is a technical scheme diagram of ambiguous word recognition.
Detailed Description
The specific embodiments described are merely illustrative of implementations of the invention and do not limit the scope of the invention. Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. As shown in fig. 2, the overall implementation flow includes five steps, and the following is a detailed description for each step:
first, pre-processing corpus
1.1) selecting a corpus of natural language processing tasks, denoted as D ═ D1,...,DnThe corpus contains n paragraphs, DiRepresenting the ith paragraph in corpus D. Delete each paragraph DiAfter the special characters and the non-recognizable characters in the Chinese character sequence are obtained, character sequences with different lengths are recorded as
Figure BDA0002227356230000041
Second, pre-training word representation
2.1) pre-training word vectors by using a word vector training tool for the preprocessed linguistic data. We can select various models such as CBOW model in word2vec, doc2vecC, and improved models based on them. Here we take CBOW model in word2vec as an example for detailed explanation.
The CBOW model contains three network layers: an input layer, a hidden layer, and an output layer. At the input level, the model defines a local context window, denoted c. The window represents that c words before and after the target word are taken, namely the total word number of the context window is |2c |. Model as paragraph DiTraining is performed in units and word representations in a context window are taken as input. At the hidden layer, the model sums the word vectors of the context window of the input, noted
Figure BDA0002227356230000051
And finally, predicting the probability value corresponding to each word of the target word in the word list by the output layer. The training process of the model is to continuously predict the target words by using the words in the context window in the current paragraph, and to maximize the objective function L of the model,
Figure BDA0002227356230000052
wherein the content of the first and second substances,
Figure BDA0002227356230000053
indicating that the target word is paragraph D in the corpusiThe number t of the word (a) in (b),
Figure BDA0002227356230000054
to represent
Figure BDA0002227356230000055
V denotes a dictionary of the current corpus.
Before training begins, the parameters of the model need to be set. The dimension of the word representation can be set to be between 100 and 1000, and the size of the context window c is set to be between 2 and 10. Because our proposed method requires identifying ambiguous words based on context semantics, precise context information needs to be preserved in the text. Therefore, we retain all low frequency words. Other parameters use default values.
2.2) after the pre-training is finished, obtaining a word-word vector mapping table and marking the mapping table as
Figure BDA0002227356230000056
Third, extracting context
3.1) define a new contextual window whose size cannot be larger than the one set in step 2.1). For example, if the size of the contextual window defined using the CBOW model is 5, then the size range of the new contextual window set at this step should be between 1-5. Rescanning the entire corpus using the context window, extracting the context of each word in different sentences in the word-word vector mapping table, generating a set
Figure BDA0002227356230000057
And
Figure BDA0002227356230000058
respectively represent words w1M context and word wnThe kth context of (1). Furthermore, the number of contexts for different words in a set need not be the same.
3.2) counting words in the contexts, deleting repeated words, generating a context dictionary for each target word and recording the context dictionary
Figure BDA0002227356230000059
I.e. wtA set of all words in the context of (a).
3.3) constructing a mapping table of words and their context dictionaries, and recording as
Figure BDA00022273562300000510
Each row of this mapping table records a word in the dictionary and the set of all words that have appeared in its context.
Fourthly, recognizing the ambiguous word
4.1) loading the mapping table obtained in the step 3.3), and mapping the context dictionary corresponding to each word in the mapping table
Figure BDA0002227356230000061
And respectively carrying out k-means clustering. k represents the number of classes of the cluster and different classes represent different contextual semantics. We use the first line in the mapping table
Figure BDA0002227356230000062
For example, wherein
Figure BDA0002227356230000063
Before the clustering operation is carried out,
Figure BDA0002227356230000064
the words in the table are converted into corresponding word vector forms according to the word-word vector mapping table obtained in step 2.2). Equation (2) is the objective function of the k-means algorithm,
Figure BDA0002227356230000065
wherein the content of the first and second substances,
Figure BDA0002227356230000066
and is
Figure BDA0002227356230000067
Is shown in the word w1Into a set of words of class i,
Figure BDA0002227356230000068
denotes w1The vector representation of the h-th word in the context set,
Figure BDA0002227356230000069
representing a set of categories
Figure BDA00022273562300000610
The central vector of (2). After clustering, we can get the set of context words for each category
Figure BDA00022273562300000611
And corresponding set of center vectors
Figure BDA00022273562300000612
After all the words in the mapping table are operated in the same way, a set is obtained
Figure BDA00022273562300000613
And collections
Figure BDA00022273562300000614
In a specific implementation, the set range of the cluster category is 2-5. Since most of the ambiguous words in the corpus basically contain 2 word senses, and there are few more word senses, we usually set 2 for the context cluster category of all words.
4.2) evaluating the clustering result of the context of each word in the mapping table by using a clustering evaluation algorithm. The cluster evaluation algorithm may use contour coefficients, CH indices, etc., among others. We take the contour coefficient as an example, for the word wtThe context clustering result is evaluated, and the specific operation steps are as follows:
4.2.1) Loading the set C obtained in step 4.1) and taking out the word wtContext clustering result set of (2), i.e.
Figure BDA00022273562300000615
4.2.2) calculating the contour coefficient of each word in the set. First, a single word in the set is fetched
Figure BDA00022273562300000616
Calculating the average distance from the word to other words in the same category and recording as
Figure BDA0002227356230000071
Is called asWord
Figure BDA0002227356230000072
Inter-cluster dissimilarity.
Figure BDA0002227356230000073
The smaller the word is, the more the word should be clustered into this class. Calculating words
Figure BDA0002227356230000074
Average distance to all words of other categories
Figure BDA0002227356230000075
Called a word
Figure BDA0002227356230000076
Dissimilarity to other categories.
Figure BDA0002227356230000077
The larger the word, the less the word belongs to the other category. Then, according to the word
Figure BDA0002227356230000078
Inter-cluster dissimilarity of
Figure BDA0002227356230000079
Degree of dissimilarity with clusters
Figure BDA00022273562300000710
Definition word
Figure BDA00022273562300000711
The profile coefficients of (a) are:
Figure BDA00022273562300000712
4.2.3) computing sets
Figure BDA00022273562300000713
The overall profile coefficient of (a) is,
Figure BDA00022273562300000714
wherein d represents wtTotal number of words in the set of contexts. The output range of the contour coefficient is [ -1,1]Higher values indicate better clustering. Evaluating the context clustering results of all the words in the set C to obtain a result set S ═ S (w)1),...,S(wn)}。
4.2.4) define a threshold value alpha for ambiguous word discrimination, each value in the set S is compared to alpha if S (w)t)>Alpha, then the word w is determinedtIs a polysemous word.
4.3) outputting the polysemous word, and using the polysemous word to obtain the central vector of each category in the step 4.1)
Figure BDA00022273562300000715
Word representations that are different word senses.
Fifth, selection of ambiguous word representation
The above steps have completed the recognition of the polysemous words, and also have obtained the word representation of different word senses of each polysemous word. In a specific task, the quality of text representation can be improved by using polysemous word representation conforming to the current semantics, and the accuracy of the task is improved. Next, we present in detail the operation steps of selecting an ambiguous word representation:
5.1) rescanning words in the corpus, once the target word appears in the polysemous table, we need to select a word representation for the polysemous that conforms to the current context semantics. Let the currently processed sentence be DtWord wt∈DtRepresenting the ambiguous word in the paragraph.
5.2) obtaining the polysemous word w by using the context window defined in the step 3.1)tIs described as
Figure BDA0002227356230000081
5.3) obtaining the word vector of the word in the context from the word-word vector mapping table in the step 2.2), and calculating the arithmetic mean of the word vector and the word vector as the context vector,
Figure BDA0002227356230000082
5.4) separately calculating
Figure BDA0002227356230000083
And wtWord representation of different word senses
Figure BDA0002227356230000084
The distance between them. The distance measurement can be performed in various ways such as euclidean distance formula (6) and cosine distance formula (7).
Figure BDA0002227356230000085
Figure BDA0002227356230000086
5.5) Final Slave Collection
Figure BDA0002227356230000087
To select one of wtContext vector
Figure BDA0002227356230000088
The closest vector is represented as the word of the ambiguous word in the current context.
The foregoing is a more detailed description of the present invention in connection with specific preferred embodiments and is not intended to limit the practice of the invention to these embodiments. For those skilled in the art to which the invention pertains, several simple deductions or substitutions may be made without departing from the spirit of the invention, which should be construed as belonging to the protection of the present invention.

Claims (7)

1. A polysemous word recognition method based on a neural network is characterized by comprising the following steps:
first, pre-processing corpus
1.1) selecting a corpus in a natural language processing task, and deleting special characters and non-recognizable characters in a text;
second, pre-training word representation
2.1) pre-training word vectors by using a word vector training tool for the preprocessed corpus; the word vector training tool comprises word2vec, doc2vecC and an improved model based on the word2vec and the doc2 vecC;
2.2) after the pre-training is finished, storing a word-word vector mapping table;
third, extracting context
3.1) defining a new context window, rescanning the whole corpus and extracting the context of each word in different sentences;
3.2) counting words in the context corresponding to each word, deleting repeated words, and generating a context dictionary corresponding to each word; each line of the dictionary records a set of words that appear in the context of a word;
3.3) mapping each context dictionary in the step 3.2) with corresponding words to construct a word-context dictionary mapping table;
fourthly, recognizing the ambiguous word
4.1) loading the word-context dictionary mapping table obtained in the step 3.3), and respectively carrying out k-means clustering on the context corresponding to each word in the mapping table, wherein k is more than or equal to 2; before clustering operation, words in the context need to be converted into corresponding word vector forms according to the word-word vector mapping table obtained in the step 2.2); after clustering operation, obtaining the category to which each word in the context dictionary belongs and the central vector of each category;
4.2) evaluating the clustering result of the context of each word in the mapping table by using a clustering evaluation algorithm; the clustering evaluation algorithm needs to take word representation participating in clustering and the category of the word as input and output an evaluation value; when the evaluation result of the context of a word is larger than a predefined threshold value, judging the word as a polysemous word;
4.3) outputting a polysemous word, and using the central vector of each category obtained in the step 4.1) of the polysemous word as word representation of different word senses;
fifth, selection of ambiguous word representation
5.1) rescanning words in the corpus, and once a target word appears in the polysemous word list, selecting a word expression which accords with the current context semantics for the polysemous word;
5.2) obtaining the context of the ambiguous word by using a context window;
5.3) obtaining word vectors of the words in the context from the word-word vector mapping table in the step 2.2), and calculating the arithmetic mean of the word vectors as the context vector;
5.4) calculating the distance between the context vector of the word and the word representation of different word senses of the word respectively;
5.5) finally selecting the polysemous word vector closest to the context vector as the word representation of the polysemous word in the current context.
2. The method according to claim 1, wherein the corpus of step 1.1) is an arbitrary corpus related to text representation.
3. The method for identifying ambiguous word based on neural network as claimed in claim 1 or 2, wherein said new context window of step 3.1) is the same as the context window in word2vec for defining the range of extracting context; the new contextual window size defined in step 3.1) cannot be larger than the window size defined when the pre-training words are represented in step 2.1).
4. The neural network-based ambiguous word recognition method of claim 1 or 2, wherein said cluster evaluation algorithm of step 4.2) comprises contour coefficients and CH indices.
5. The neural network-based ambiguous word recognition method of claim 3, wherein said cluster evaluation algorithm of step 4.2) comprises contour coefficients and CH indices.
6. The neural network-based ambiguous word recognition method of claim 1, 2 or 5, wherein said context window of step 5.2) is consistent with the context window defined in step 3.1); and 5.4) adopting an Euclidean distance or cosine distance as the distance measurement mode.
7. The neural network-based ambiguous word recognition method of claim 3, wherein said context window of step 5.2) is consistent with the context window defined in step 3.1); and 5.4) adopting an Euclidean distance or cosine distance as the distance measurement mode.
CN201910956103.8A 2019-10-10 2019-10-10 Neural network-based polysemous word recognition method Active CN110717015B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910956103.8A CN110717015B (en) 2019-10-10 2019-10-10 Neural network-based polysemous word recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910956103.8A CN110717015B (en) 2019-10-10 2019-10-10 Neural network-based polysemous word recognition method

Publications (2)

Publication Number Publication Date
CN110717015A CN110717015A (en) 2020-01-21
CN110717015B true CN110717015B (en) 2021-03-26

Family

ID=69212371

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910956103.8A Active CN110717015B (en) 2019-10-10 2019-10-10 Neural network-based polysemous word recognition method

Country Status (1)

Country Link
CN (1) CN110717015B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113255345B (en) * 2021-06-10 2021-10-15 腾讯科技(深圳)有限公司 Semantic recognition method, related device and equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9213687B2 (en) * 2009-03-23 2015-12-15 Lawrence Au Compassion, variety and cohesion for methods of text analytics, writing, search, user interfaces
CN106909537B (en) * 2017-02-07 2020-04-07 中山大学 One-word polysemous analysis method based on topic model and vector space
CN107861939B (en) * 2017-09-30 2021-05-14 昆明理工大学 Domain entity disambiguation method fusing word vector and topic model
CN109033307B (en) * 2018-07-17 2021-08-31 华北水利水电大学 CRP clustering-based word multi-prototype vector representation and word sense disambiguation method

Also Published As

Publication number Publication date
CN110717015A (en) 2020-01-21

Similar Documents

Publication Publication Date Title
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
CN110209806B (en) Text classification method, text classification device and computer readable storage medium
CN110321925B (en) Text multi-granularity similarity comparison method based on semantic aggregated fingerprints
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN110019732B (en) Intelligent question answering method and related device
CN110263325B (en) Chinese word segmentation system
CN110543564B (en) Domain label acquisition method based on topic model
CN110097096B (en) Text classification method based on TF-IDF matrix and capsule network
CN111738007A (en) Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network
CN108228541A (en) The method and apparatus for generating documentation summary
CN110633365A (en) Word vector-based hierarchical multi-label text classification method and system
CN114969275A (en) Conversation method and system based on bank knowledge graph
CN108846033B (en) Method and device for discovering specific domain vocabulary and training classifier
CN113836896A (en) Patent text abstract generation method and device based on deep learning
CN114997288A (en) Design resource association method
CN112989813A (en) Scientific and technological resource relation extraction method and device based on pre-training language model
CN112860898B (en) Short text box clustering method, system, equipment and storage medium
CN114491062A (en) Short text classification method fusing knowledge graph and topic model
CN116629258B (en) Structured analysis method and system for judicial document based on complex information item data
CN110717015B (en) Neural network-based polysemous word recognition method
Tianxiong et al. Identifying chinese event factuality with convolutional neural networks
CN111125329B (en) Text information screening method, device and equipment
Thielmann et al. Coherence based document clustering
CN117057346A (en) Domain keyword extraction method based on weighted textRank and K-means
Vijayaraju Image retrieval using image captioning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant