CN112949319B

CN112949319B - Method, device, processor and storage medium for marking ambiguous words in text

Info

Publication number: CN112949319B
Application number: CN202110270079.XA
Authority: CN
Inventors: 陆恒杨; 黄渊卓; 方伟
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2023-01-06
Anticipated expiration: 2041-03-12
Also published as: CN112949319A

Abstract

The invention relates to a method, equipment, a processor and a storage medium for marking ambiguous words in a text; the method comprises the steps of inputting an original corpus to be processed; training a context-related word embedding model to obtain a context-related vector; establishing a semantic vector generation algorithm according to the context correlation vector, distinguishing a plurality of meanings of each word, and labeling an original corpus; and outputting the pseudo document labeled with the ambiguity. The context related word embedding model is utilized, different semantemes of the polysemous words are labeled, ambiguity of various semantemes is eliminated, and compared with a text without considering word polysemous, the accuracy of tasks such as subsequent text processing, text classification and a theme model is greatly improved.

Description

Method, device, processor and storage medium for marking ambiguous words in text

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a method, a device, a processor, and a storage medium for annotating ambiguous words in a text.

Background

Word embedding models (word embedding) are often used in various types of natural language processing tasks, such as text mining, sentiment analysis, text classification, and the like. Common word embedding models, such as word2vec and GloVe, can only learn one vector for each word, ignoring word ambiguity problems in different contexts, e.g., the word "applet" can have multiple semantics, in the word "I like eating applets", it refers to a fruit; when appearing in the sentence "We went to the Apple store yesterday," it refers to the name of a technology company. This phenomenon can be attributed to word ambiguity problems. Recent studies have shown that considering word ambiguity issues can effectively improve the effectiveness of models when natural language processing tasks are performed.

Generally, different word vectors are learned in different contexts for the same word, and the learning is mainly performed by three types of learning paradigms: two-stage models, joint models, and contextized word models. Wherein Two-stage models categorize word semantics by clustering given contexts, which has the disadvantage of large computation; the Joint models cluster the context vectors of given words to jointly form the semantics of the words, so that the limitation of using local context only is solved, the method further extends to the embedding of ambiguous words and the like, and the disadvantage is that most methods need to define the fixed semantic quantity of each word, which is unrealistic; context-related vectors of words are learned by training a bidirectional LSTM language model, each word in a corpus has a semantic vector, the semantic vectors can work together in the context of all layers, the context-related vectors can be easily added to various existing NLP tasks, and the ambiguity of the words can be found more flexibly by using the model. Therefore, the present application adopts Contextualized word elements for word ambiguity labeling.

Disclosure of Invention

Therefore, the technical problem to be solved by the invention is to provide an ambiguous word labeling method combining a context word embedding model, which is used for solving the problem of word ambiguity and labeling the word semantics most suitable for the text context.

In order to solve the technical problem, the invention provides a method for labeling ambiguous words in a text, which comprises the steps of inputting an original corpus to be processed; training a context-related word embedding model to obtain a context-related vector; building a semantic vector generation algorithm according to the context correlation vector, distinguishing a plurality of meanings of each word, and labeling an original corpus; and outputting the pseudo document labeled with the ambiguity.

In an embodiment of the present invention, the context-related word embedding model is ELMo, and a specific calculation manner using the ELMo algorithm is as follows:

wherein gamma is a parameter for adjusting the vector scale; s _j Normalizing the parameters of the weight for the softmax of the j-th layer;

is the jth hidden layer in both forward and backward directions.

In an embodiment of the present invention, the process of constructing the semantic vector generation algorithm is as follows:

inputting an original corpus D, a dictionary dic of each word and a corresponding semantic context correlation vector thereof, and a cosine distance threshold epsilon

Initialize dic to null

foreach document d in D do

for i←0to len(d)-1do

If the current semantics

Not in the dictionary dic, but in the dictionary dic,

will be provided with

Is assigned to w _i #s；

Will be provided with

Is assigned to

Will be provided with

Adding to dictionary dic

The preparation method comprises the following steps of (1) performing;

will w _i Addition of # s to pd _i Performing the following steps;

if not, then,

initializing minDist =1 and minIndex =0;

initializing found = False;

for k←0to len(dic

)-1do。

in one embodiment of the invention, the cosine distance of two semantic vectors is calculated, expressed as

The calculation formula is as follows:

in one embodiment of the invention, the computed cosine distance is used

Comparing with a preset cosine distance threshold value epsilon, and if the cosine distance is equal to the preset cosine distance threshold value epsilon

If the distance is greater than the cosine distance threshold epsilon, the two semantic vectors have different meanings in the two contexts; distance of cosine

Less than the cosine distance threshold epsilon, the two semantic vectors have the same meaning in both contexts.

In one embodiment of the present invention, when two semantic vectors have the same meaning in two contexts, the context correlation vector needs to be recalculated, and a new semantic vector bisector is formed by using the angle bisector of the two semantic vectors, and the calculation formula is:

in an embodiment of the present invention, before inputting an original corpus to be processed, the original corpus needs to be preprocessed, which includes: unifying case letters, deleting all stop words, deleting documents containing less than three words

The present invention provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method when executing the program.

The present invention provides a processor for running a program, wherein the program executes the method.

To solve the above technical problem, the present invention provides a computer-readable storage medium having a computer program stored thereon, wherein the program, when executed by a processor, implements the steps of the above method.

Compared with the prior art, the technical scheme of the invention has the following advantages:

the invention provides an ambiguous word labeling method combining a context word embedding model, which is used for solving the problem of word ambiguity and labeling word semantics most suitable for a text context;

each word has different semantics in different context, which is roughly divided into two types, one is to have the same semantics in different context, but the word vectors of the word are similar but not identical due to the difference of the respective partial contexts. The other is that the text has different semantics in different context, which causes the word vector of the word to have larger difference, and the two conditions cause the word vector of some ambiguous words in the text to have misalignment degree, so as to generate ambiguity; the invention utilizes the context related word embedded model, eliminates ambiguity of various semantics by labeling different semantics of the ambiguous words, and greatly improves the accuracy of tasks such as subsequent text processing, text classification, topic model and the like compared with the text without considering word ambiguity.

Drawings

In order that the present disclosure may be more readily and clearly understood, reference is now made to the following detailed description of the embodiments of the present disclosure taken in conjunction with the accompanying drawings, in which

FIG. 1 is a flow chart of the steps of the method for labeling ambiguous words in text according to the present invention.

Detailed Description

The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.

Referring to fig. 1, a method for labeling ambiguous words in a text according to the present invention includes inputting an original corpus to be processed; training a context-related word embedding model to obtain a context-related vector; establishing a semantic vector generation algorithm according to the context correlation vector, distinguishing a plurality of meanings of each word, and labeling an original corpus; and outputting the pseudo document labeled with the ambiguity.

Each word has different semantics in different context, which is roughly divided into two types, one is to have the same semantics in different context, but the word vectors of the word are similar but not identical due to the difference of the contexts of the parts. The other is that different semantics are possessed in different context, which causes the word vectors of the words to have larger difference, and the two conditions cause the word vectors of some ambiguous words in the text to have misalignment degree, thereby generating ambiguity; the context related word embedding model is utilized, different semantemes of the polysemous words are labeled, ambiguity of various semantemes is eliminated, and compared with a text without considering word polysemous, the accuracy of tasks such as subsequent text processing, text classification and a theme model is greatly improved.

Specifically, in this embodiment, before the original corpus to be processed is input, the preprocessing is performed on the original corpus, which includes: the method has the advantages that the method can reduce the vocabulary quantity on one hand and reduce the calculation quantity of an algorithm on the other hand, improves the calculation efficiency, and does not have context for the documents containing less than three words, so that a context related word embedding model cannot be trained, and the documents containing less than three words are actively eliminated before input.

In this embodiment, the context-dependent word embedding model is an ELMo algorithm, which is a Language model pre-training methodology in the existing Natural Language Processing (NLP), ELMo (embedded from Language Models), which uses a bidirectional LSTM Language model, and is composed of a forward Language model and a backward Language model, and the objective function is the maximum likelihood of the two directional Language Models, and the algorithm is characterized in that: the representation of each word is a function of the whole input sentence, and the specific method is to train a bidirectional LSTM model on a large corpus by taking a language model as a target and then generate the representation of the word by using the LSTM. ELMo tokens are "deep," that is, they are a function of the internal tokens of all layers of the bi-directional LSTM, which has the advantage of being able to produce rich word tokens. The state of the higher level LSTM can capture the features of the word meaning in the context (e.g., can be used to disambiguate semantics), while the lower level LSTM can find the features of the grammar (e.g., can be part-of-speech tagged). If they are combined, advantages will be realized in the NLP task downstream.

In this embodiment, a specific calculation method using the ELMo algorithm is as follows:

is the j-th hidden layer in the forward and backward directions, in this way a context correlation vector is obtained.

Specifically, the process of constructing the semantic vector generation algorithm is as follows:

firstly, input original corpus D, dictionary dic of each word and its corresponding semantic context related vector, cosine distance threshold epsilon

Initialize dic to null

foreach document d in D do

for i←0to len(d)-1do

If the current semantics

Not in the dictionary dic, but in the dictionary dic,

will be provided with

Is assigned to w _i #s；

Will be provided with

Is assigned to

Will be provided with

Adding to dictionary dic

Performing the following steps;

will w _i Addition of # s to pd _i Performing the following steps;

if not, then,

initializing minDist =1 and minIndex =0;

initializing found = False;

for k←0to len(dic

)-1do。

the cosine distance of two semantic vectors is calculated and expressed as

The calculation formula is as follows:

the cosine distance obtained by calculation

If the distance is less than the cosine distance threshold epsilon, the two semantic vectors have the same meaning in the two contexts, and the specific algorithm process is as follows:

if it is

And is

Change the value of minDist to

Changing the value of minidex to k;

changing the value of Found to True;

if the mount is False,

changing the value of s to len (dic)

)；

Will be provided with

Is assigned to w _i #s；

Will be provided with

Is assigned to

Will be provided with

Addition to dic

[s]；

Will w _i Addition of # s to pd _i Performing the following steps;

if not, then,

changing the value of s to minidex;

will be provided with

Is assigned to w _i #s；

When two semantic vectors have the same meaning in two contexts, the context related vector needs to be recalculated, and a new semantic vector bisector is formed by using the angle bisector of the two semantic vectors, and the calculation formula is:

applying the above semantic vector bisectorUpdating

The value of (a) is,

will be provided with

Adding to dictionary dic

[s]The preparation method comprises the following steps of (1) performing;

will w _i Addition of # s to pd _i In (1).

Will pd _i And adding the document to a PD (pseudo document), and finally outputting the pseudo document marked with the ambiguity.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications derived therefrom are intended to be within the scope of the invention.

Claims

1. A method for labeling ambiguous words in text is characterized in that: the method comprises the following steps:

inputting an original corpus to be processed;

training a context-related word embedding model to obtain a context-related vector;

building a semantic vector generation algorithm according to the context correlation vector, distinguishing a plurality of meanings of each word, and labeling an original corpus, specifically comprising the following steps: calculating the cosine distance of the two semantic vectors, comparing the calculated cosine distance with a preset cosine distance threshold value epsilon, and if the cosine distance is greater than the cosine distance threshold value epsilon, the two semantic vectors have different meanings in the two contexts; if the cosine distance is smaller than the cosine distance threshold epsilon, the two semantic vectors have the same meaning in the two contexts, when the two semantic vectors have the same meaning in the two contexts, context related vectors need to be calculated again, and a new semantic vector bisegctor is formed by utilizing the angle bisector of the two semantic vectors;

and outputting the pseudo document labeled with the ambiguity.

2. The method for labeling ambiguous words in text as recited in claim 1, further comprising: the context-related word embedding model is ELMo, and a specific calculation mode by utilizing an ELMo algorithm is as follows:

wherein gamma is a parameter for adjusting the vector scale; s _j Normalizing parameters of the weight for the jth layer softmax;

is the jth hidden layer in both forward and backward directions.

3. The method for labeling ambiguous words in text as recited in claim 1, further comprising: the process of constructing the semantic vector generation algorithm comprises the following steps:

inputting an original corpus D, a dictionary dic of each word and the corresponding semantic context related vector, and a cosine distance threshold epsilon

Initialize dic to null

foreach document d in D do

for i←0 to len(d)-1 do

If the current semantics

Not in the dictionary dic, but,

will be provided with

Is assigned to w _i #s；

Will be provided with

Is assigned to

Will be provided with

Add to dictionary

Performing the following steps;

will w _i Addition of # s to pd _i The preparation method comprises the following steps of (1) performing;

if not, then,

initializing minDist =1 and minIndex =0;

initializing found = False;

4. the method for labeling ambiguous words in text as recited in claim 1, further comprising: before inputting an original corpus to be processed, preprocessing the original corpus is required, including: unifying case letters, deleting all stop words, deleting documents containing less than three words.

5. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any one of claims 1 to 4 when executing the program.

6. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the method of any of claims 1 to 4.

7. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 4.