CN115048924A

CN115048924A - Negative sentence identification method based on negative prefix and suffix information

Info

Publication number: CN115048924A
Application number: CN202210976289.5A
Authority: CN
Inventors: 李寿山; 李雅梦; 周国栋
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2022-08-15
Filing date: 2022-08-15
Publication date: 2022-09-13
Anticipated expiration: 2042-08-15
Also published as: CN115048924B

Abstract

The invention discloses a negative sentence recognition method based on negative prefix and suffix information, which firstly utilizes a word training set to train an auxiliary task model for acquiring the information of words with negative prefix and suffix; then, a sentence training set is utilized to train a main task model for identifying a negative sentence, in the training process of the main task model, a trained auxiliary task model is utilized to obtain a first hidden layer feature representation of a word with a negative suffix in the sentence, the first hidden layer feature representation is inserted into a second hidden layer feature representation of the sentence to update the hidden layer feature representation of the whole sentence, and the training of the main task model is carried out; and finally, recognizing the target sentence by using the trained main task model and the trained auxiliary task model. The method and the device model the recognition of the negative words into a matching model, and can greatly improve the recognition accuracy of the negative sentences by recognizing the words with the negative suffixes in the sentences and updating the hidden layer characteristic representation of the sentences.

Description

Negative sentence identification method based on negative prefix and suffix information

Technical Field

The invention relates to the technical field of natural language processing, in particular to a negative sentence identification method.

Background

The negative sentence recognition task aims to automatically recognize whether an input sentence is a negative sentence with negative clue words, and is a basic task in negative sentence understanding. Negation is a common phenomenon in natural language description, and is also a core part of natural language description. Many natural language processing tasks require understanding negation to better understand the semantic information of text, such as: such as emotion analysis, question answering, knowledge graph completion, natural language reasoning, etc. In these tasks, the potential of the model in learning and characterizing high-level morphological/syntactic knowledge about a given markup usage in a sentence can be explored through the detection of negative clue words and negative scopes.

With the development of deep learning, the negative sentence recognition method turns from sequence annotation to the use of a deep neural network model and numerous derivative models, researchers carry out negative sentence recognition task research through complex models and a large amount of annotation data, and the development of the natural language processing field is promoted due to the appearance of a pre-training model. The most widespread approach at present is to use a pre-trained model, modeling the problem as one that inputs a piece of text and outputs a label.

The prior art mainly comprises the following steps: (1) a professional labels a large number of texts with different polarity labels, each text is used as a sample, and a plurality of labeled corpora with labeled samples are obtained; (2) the model obtains classification capability through training the labeled corpus, and the two modes are provided, wherein one mode is based on a sequence labeling mode, negative sentence identification is carried out through identifying and matching negative clue words to obtain a classification model, and the other mode is based on deep learning network (generally, a recurrent neural network, a pre-training language model and the like) training to obtain the classification model; (3) and testing the text of a certain unknown label by using a classification model to obtain a polarity label of the text segment. During the test, each time the classification model is entered, a single text is entered.

And the second step is based on a sequence labeling mode, and the network structure for identifying the negative sentences by identifying and matching the negative clue words comprises an Encoder (Encoder) layer and an FC full-connection layer. The Encoder layer is responsible for extracting the characteristics of the text, and common Encoder layers comprise LSTM, BERT and the like. The FC full connectivity layer is responsible for mapping text features to label categories of text. Inputting a section of text, and coding the text to obtain a coding vector to obtain the characteristics of the text; and then mapping the text features to each word in the text through a full connection layer, then identifying negative clue words in the text segment, and finally realizing the classification of the text. Or the recognition of negative clue words is carried out directly by constructing a word list or by using a CRF (learning cycle) combined with feature engineering, and finally the classification of the text is realized.

The deep learning network structure of the second step comprises an Encoder (Encoder) layer and an FC full connection layer. The Encoder layer is responsible for extracting the characteristics of the text, and common Encoder layers comprise LSTM, BERT and the like. The FC full connectivity layer is responsible for mapping text features to label categories of text. Inputting a section of text, and coding the text to obtain the characteristics of the text; and finally, directly realizing text classification through a full connection layer.

Much research is currently being focused on improving the characterization capabilities of the models or exploring the knowledge gained by language modeling. Based on deep learning of the pre-trained model, what the model may learn is only statistical information in the data, and the negative meaning is not really understood. Through experimental results, we found that the BERT model often makes mistakes in understanding the negative suffix (e.g., 'in-', 'im-', 'less' etc.) in English samples in the task of negative sentence recognition. For example, "On my marking that is I was consistent in the word of doing the same thing that you expressed the input," BERT misclassifies it as a non-negative sentence because an error is identified for the word "input" that contains a negative suffix.

In summary, since some negative information is difficult to identify, it is difficult to make a correct judgment through a neural network (LSTM, pre-trained language model, etc.) and feature engineering, etc. in the prior art, and the identification rate of classification is not high enough.

Disclosure of Invention

The invention aims to solve the technical problem of providing a negative sentence identification method based on negative prefix and suffix information, which has high identification accuracy.

In order to solve the above problem, the present invention provides a negative sentence identification method based on negative prefix and suffix information, including the steps of:

s1, training an auxiliary task model by utilizing a word training set, wherein the word training set consists of well-labeled words with negative prefixes and suffixes, and the auxiliary task model comprises a first sequence encoder, a first linear layer and a first softmax activation layer; the method comprises the following steps:

s11, splicing the embedded vectors of the words with the negative suffixes in the word training set and the embedded vectors of the words without the negative suffixes, inputting the spliced embedded vectors into the first sequence encoder, and outputting hidden layer feature representation indicating whether the two words are matched or not by the first sequence encoder, wherein the hidden layer feature representation is recorded as first hidden layer feature representation;

s12, mapping the hidden layer feature representation whether the two words are matched or not into a first label set by the first linear layer, and normalizing through a first softmax activation layer to obtain a prediction label of the word with a negative prefix-suffix; the first label set comprises two prediction labels of a non-negative word and a negative word;

s2, training a main task model by utilizing a sentence training set, wherein the sentence training set is formed by marked sentences containing words with negative suffixes, and the main task model comprises a second sequence encoder, a second linear layer and a second softmax activation layer; the method comprises the following steps:

s21, splitting sentences in the sentence training set into word sequences, and inputting the word sequences into the second sequence encoder, wherein the second sequence encoder outputs hidden layer feature representation of the sentences, and the hidden layer feature representation is recorded as second hidden layer feature representation;

s22, inputting the words with negative suffixes in the sentences into the trained auxiliary task model to obtain a first hidden layer feature representation and a prediction label of the words; inputting a second hidden layer feature representation into the second linear layer if the predicted tag is a non-negative word; if the predicted tag is a negative word, inserting the first hidden layer feature representation of the predicted tag into a second hidden layer feature representation of the sentence where the predicted tag is located to update the second hidden layer feature representation, and inputting the updated second hidden layer feature representation into the second linear layer;

s23, the second linear layer maps the input second hidden layer representation into a second label set, and normalization is carried out through a second softmax activation layer, so that a prediction label of a sentence comprising a word with a negative suffix is obtained; the second label set comprises two prediction labels of a non-negative sentence and a negative sentence;

and S3, recognizing the target sentence by using the trained main task model and the trained auxiliary task model, and predicting whether the target sentence is a negative sentence or a non-negative sentence.

As a further improvement of the present invention, in step S11, the concatenation of the embedded vectors of the words with negative suffixes in the training set of words and the embedded vectors of the words after the negative suffixes are removed is represented as:

where w represents a word with a negative suffix,

meaning that w removes words following a negative suffix,

an embedded vector representation representing the word w,

to represent

Is used to represent the embedded vector.

As a further improvement of the present invention, in step S11, the first sequence encoder outputs a hidden layer feature representation of whether two words match, as follows:

wherein the content of the first and second substances,

encode 1 represents the Encode function of the first sequence Encoder for a hidden layer feature representation of whether two words match.

As a further improvement of the present invention, in step S12, the predicted labels of the words with the negative suffixes are obtained as follows:

wherein softmax1 represents the softmax function of the first softmax activation layer,W ₁ representing the weight matrix to be learned in the first linear layer, y _Aux Representing the prediction probability of the auxiliary task model,

representing the predicted outcome of the auxiliary task model,

mapping into a first set of labels non-negated, negated words.

As a further improvement of the present invention, in step S21, the sentences in the sentence training set are split into word sequences as follows:

wherein n is the number of words in the sentence; i =1,2,. n;w _i corresponding to the ith word in the sentence;x _Main a word sequence split for a sentence.

As a further refinement of the present invention, in step S21, the second sequence encoder outputs a hidden-layer feature representation of a sentence as follows:

wherein, the first and the second end of the pipe are connected with each other,

for the hidden-layer feature representation of the ith word in the sentence,

for hidden layer feature representation of sentences, Encoder2 represents the Encoder function of the second sequence Encoder.

As a further improvement of the present invention, in step S22, if the predicted tag is a negative word, the method includes inserting the first hidden-layer feature representation into a second hidden-layer feature representation of the sentence to update the second hidden-layer feature representation, and includes: if the predicted label is a negative word, inserting the first hidden layer feature representation of the predicted label into the position corresponding to the word in the word sequence to obtain an updated second hidden layer feature representation, wherein the updated second hidden layer feature representation comprises the following steps:

wherein the content of the first and second substances,

to predict a first hidden-layer feature representation of a word tagged as a negative word,

is the updated second hidden layer feature representation.

As a further improvement of the present invention, in step S23, the predicted labels of the input sentence are obtained as follows:

wherein softmax2 represents the softmax function of the second softmax activation layer,W ₂ representing a second lineWeight matrix to be learned in the sexual layer, y _Main Representing the prediction probability of the main task model,

representing the predicted outcome of the main task model,

mapping into a second set of tags { non-negatives, negatives }.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of any one of the above methods when executing the program.

The invention also provides a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the steps of any of the methods described above.

The invention has the beneficial effects that:

the invention discloses a negative sentence recognition method based on negative prefix and suffix information, which comprises the steps of firstly training an auxiliary task model by utilizing a word training set and obtaining information of words with negative prefixes and suffixes, then training a main task model by utilizing a sentence training set and recognizing the negative sentences, in the training process of the main task model, obtaining a first hidden layer feature representation of the words with the negative prefixes and suffixes in the sentences by utilizing the trained auxiliary task model, inserting the first hidden layer feature representation into a second hidden layer feature representation of the sentences to train the main task model, and recognizing target sentences by utilizing the trained main task model and the trained auxiliary task model. The method and the device model the recognition of the negative words into a matching model, and can greatly improve the recognition accuracy of the negative sentences by recognizing the words with the negative suffixes in the sentences and updating the hidden layer characteristic representation of the sentences.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following preferred embodiments are specifically described below with reference to the accompanying drawings.

Drawings

Fig. 1 is a schematic diagram of a negative sentence identification method based on negative prefix and suffix information in the preferred embodiment of the present invention.

Detailed Description

The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.

Example one

As shown in fig. 1, the present embodiment discloses a negative sentence identification method based on negative prefix and suffix information, which includes the following steps:

optionally, after the embedded vectors of the words with the negative suffix in the word training set and the embedded vectors of the words after the negative suffix is removed from the embedded vectors are spliced, the following expression is given:

(1)

where w represents a word with a negative suffix,

meaning that w removes words following a negative suffix,

an embedded vector representation representing the word w,

to represent

Is used to represent the embedded vector. For example, wordsw= impressability and word after removal of negative suffix

=“possibly”。

Optionally, the first sequence encoder outputs a hidden layer feature representation of whether two words match, as follows:

(2)

wherein the content of the first and second substances,

The first sequence encoder is a sequence learning encoder of a pre-training language model, and the pre-training language model can be BERT, XLNET and variants thereof.

S12, mapping the hidden layer feature representation whether the two words are matched into a first label set by the first linear layer, and normalizing through the first softmax activation layer to obtain a prediction label of the word with a negative suffix; the first label set comprises two prediction labels of a non-negative word and a negative word;

alternatively, the predicted labels for words with a negative suffix are as follows:

(3)

representing the predicted outcome of the auxiliary task model,

mapping into a first set of labels non-negated, negated words.

And training the auxiliary task model by using all words with negative suffixes in the word training set to obtain the trained auxiliary task model.

S2, training a main task model by utilizing a sentence training set, wherein the sentence training set is formed by labeled sentences containing words with negative suffixes, and the main task model comprises a second sequence encoder, a second linear layer and a second softmax activation layer; the method comprises the following steps:

optionally, in step S21, the sentences in the sentence training set are split into word sequences as follows:

(4)

In step S21, the second sequence encoder outputs a hidden-layer feature representation of a sentence as follows:

(5)

wherein the content of the first and second substances,

for the hidden layer feature representation of the ith word in the sentence,

suppose w _i For a word with a negative suffix, in step S22, if the predicted tag is a negative word, inserting the first hidden-layer feature representation into the second hidden-layer feature representation of the sentence to update the second hidden-layer feature representation, including: if the predicted label is a negative word, inserting the first hidden layer feature representation of the predicted label into the position corresponding to the word in the word sequence to obtain an updated second hidden layer feature representation, wherein the updated second hidden layer feature representation comprises the following steps:

(6)

wherein the content of the first and second substances,

for words w tagged with a predictive negative word _i Is represented by a first hidden layer of features of (1),

is the updated second hidden layer feature representation.

alternatively, the predicted tags for a sentence comprising a word with a negative suffix are derived as follows:

(7)

wherein softmax2 represents the softmax function of the second softmax activation layer,W ₂ representing the weight matrix to be learned in the second linear layer, y _Main Representing the prediction probability of the main task model,

represents the prediction result of the master model,

mapping into a second set of tags { non-negatives, negatives }.

And training the main task model by utilizing all sentences which comprise words with negative suffixes and suffixes in the sentence training set to obtain the trained main task model.

As shown in table 1, the training result of the auxiliary task model in one embodiment is shown. Wherein Macro-F ₁ Representing F in all categories ₁ Average of score values, 1-F ₁ F representing a positive sample ₁ Value, 0-F ₁ F representing negative examples ₁ The value, accuracy, indicates the accuracy of the classification. Experimental results show that the performance of prefix and prefix identification is higher before and after negation based on BERT-base, Macro-F ₁ The value and the accuracy rate both reach the effect of 0.903. Class 0 (not negative suffix) and class 1 (negative suffix) also exceed 0.90 in performance. These better recognition capabilities provide a good basis for helping the master task model.

TABLE 1

Table 2 shows experimental results of negative sentence recognition tasks based on different recognition methods. The identification of negative clue words in the sentence is firstly carried out in the first four columns from top to bottom, then the identification of negative sentences is carried out, and a negative word list construction method, 2 CRF + characteristic engineering methods based on sequence labeling and a BilSTM method are sequentially carried out from top to bottom. The characteristic engineering 1 method is that negative clue words are divided into 4 types, a CRF sequence labeling model is trained by taking the vocabulary characteristics as the main part, and the negative clue words are labeled; and the feature engineering 2 is to automatically construct a non-negative clue word list and a high-frequency negative clue word list from the training corpus after eliminating the condition of negating prefix and suffix, and train a CRF sequence labeling model by combining the negative prefix and suffix characteristics in the negative clue words with the vocabulary characteristics. The three columns from top to bottom and the method of the invention directly identify the negative sentence. The reported result is the average result of 5 different random seeds, negative sentence recognition is directly carried out on the four columns from top to bottom, models used from top to bottom are BiLSTM, SVM and BERT-base in sequence, and the model used by the method is also BERT-base.

TABLE 2

From the results of table 2 it can be seen that: (1) the method provided by the invention is used for F of the class 1 sample ₁ The value is improved most obviously, and is improved by 1.8% -14.5% compared with several reference methods. The result verifies that the method can better improve the performance of the type 1 (negative sentence). (2) The method provided by the invention has relatively small improvement on the value of the class 0 sample, and is improved by 0.6% -8.5% compared with several reference methods. (3) The method provided by the invention can effectively improve the overall result F ₁ Value, in contrast to several reference methods, the method of the invention Macro-F ₁ Respectively increased by 1.2% -14.5%. Meanwhile, as the characteristic engineering of the method CRF + characteristic engineering 2 has the special treatment for negating suffixes, the best result except the method of the invention is obtained, and the importance of the negating suffixes in the recognition of negating sentences is further verified. The result fully verifies that the negative sentence identification method based on the negative prefix and suffix information can effectively improve the performance of negative sentence identification.

For the auxiliary task model, the present embodiment selects 6 prefixes that are common in english and may contain negative meaning: "un-", "im-", "in-", "il-", "ir-" and "dis-" and 2 common suffixes "-less" and "-free" that may contain a negative meaning. The words in the experiment are collected from the ninth version of the oxford english-chinese dictionary and 160 ten thousand english tweets collected by Go et al. Specifically, the method extracts a total of 2717 words containing negative suffixes from the dictionary; a total of 6671 words containing negative suffixes were extracted from the english tweet corpus. For each word containing a negative suffix, its label has two possibilities, namely "negation word" and "non-negation word". A "negative word" means that the word has a negative meaning due to the prefix/suffix of the word, and a "non-negative word" means that the prefix/suffix of the word does not affect the positive/negative meaning of the word itself. The method randomly selects 3000 words from the obtained words to allow two annotators to carry out manual annotation, and for uncertain words, please use the third annotator to carry out annotation. The Kappa value of the consistency test was 0.87. The annotation data statistics are shown in table 3.

TABLE 3

To better validate the auxiliary task model, the present invention ensures that the data set used in the auxiliary task model does not include words with negative suffixes in the main task data set. Finally, 2000 samples with balanced positive and negative are selected from the labeled corpus to perform an auxiliary task experiment. The distribution of the labeled samples is shown in table 3. The 2000 data were randomly partitioned into a training set, a validation set, and a test set at a 7:1:2 ratio.

For the main task model, the inventive method used 2012 × SEM shared task data for the experiments. 2012 SEM's data in the shared task dataset is in the format of a CoNLL, where the composition structure of each word data mainly includes: current word, root word, part-of-speech tag POS, syntax tree, negative information and the like. The negative information includes whether the current word is a negative clue word and whether it is in a negative range. 5519 sentences are extracted from 2012 SEM shared data sets, and are classified according to negative information marked by the sentences, wherein the training sets comprise 3643 negative sentences, wherein the negative sentences are 848 sentences, and the non-negative sentences are 2795 sentences; 787 verification sets are provided, wherein the negative sentences are 144, and the non-negative sentences are 643; 1089 test sets, 235 negative sentences and 854 non-negative sentences. In the experiment, the method of the invention keeps 2012 the original data set division mode of the SEM sharing task.

Example two

The embodiment discloses an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the negative sentence identification method based on negative prefix and suffix information in the first embodiment when executing the program.

EXAMPLE III

The present embodiment discloses a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the negative sentence identification method based on negative prefix and suffix information in the first embodiment.

The above embodiments are merely preferred embodiments for fully illustrating the present invention, and the scope of the present invention is not limited thereto. The equivalent substitution or change made by the technical personnel in the technical field on the basis of the invention is all within the protection scope of the invention. The protection scope of the invention is subject to the claims.

Claims

1. The negative sentence identification method based on the negative prefix and suffix information is characterized by comprising the following steps of:

s1, training an auxiliary task model by utilizing a word training set, wherein the word training set is formed by marked words with negative suffixes, and the auxiliary task model comprises a first sequence encoder, a first linear layer and a first softmax activation layer; the method comprises the following steps:

s23, the second linear layer maps the input second hidden layer representation into a second label set, and normalization is carried out through a second softmax activation layer, so that a prediction label of a sentence comprising a word with a negative prefix-suffix is obtained; the second label set comprises two prediction labels of a non-negative sentence and a negative sentence;

2. The method for identifying a negative sentence according to claim 1, wherein in step S11, the step of concatenating the embedded vectors of the word with the negative suffix and the embedded vectors of the word after the negative suffix is removed in the training set of words is represented as:

where w represents a word with a negative suffix,

meaning that w removes words following a negative suffix,

an embedded vector representation representing the word w,

to represent

Is used to represent the embedded vector.

3. The negative sentence recognition method based on negative prefix and affix information of claim 2 wherein, in step S11, the first sequence encoder outputs a hidden-layer feature representation of whether two words match as follows:

wherein the content of the first and second substances,

4. The negative sentence recognition method based on the negative prefix and suffix information according to claim 3, wherein in step S12, the predicted label of the word with the negative prefix and suffix is obtained as follows:

representing the predicted outcome of the auxiliary task model,

mapping into a first set of labels non-negated, negated words.

5. The method for identifying a negative sentence according to claim 1, wherein in step S21, the sentences in the sentence training set are divided into word sequences as follows:

wherein n is the number of words in the sentence; i =1,2,. n; w is a _i Corresponding to the ith word in the sentence;x _Main a word sequence split for a sentence.

6. The negative sentence recognition method based on negative prefix and affix information of claim 5 wherein, in step S21, the second sequence encoder outputs a hidden-layer feature representation of a sentence as follows:

wherein the content of the first and second substances,

for the hidden layer feature representation of the ith word in the sentence,

7. The method for identifying a negative sentence according to claim 6, wherein in step S22, if the predicted tag is a negative word, the step of inserting the first hidden-layer feature representation into the second hidden-layer feature representation of the sentence to update the second hidden-layer feature representation comprises: if the predicted label is a negative word, inserting the first hidden layer feature representation of the predicted label into the position corresponding to the word in the word sequence to obtain an updated second hidden layer feature representation, wherein the updated second hidden layer feature representation comprises the following steps:

wherein the content of the first and second substances,

is the updated second hidden layer feature representation.

8. The negative sentence identification method based on the negative prefix and suffix information according to claim 7, wherein in step S23, the prediction tags of the input sentence are obtained as follows:

representing the predicted outcome of the main task model,

mapping into a second set of tags { non-negatives, negatives }.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1-8 are implemented when the processor executes the program.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.