CN114443846A

CN114443846A - Classification method and device based on multi-level text abnormal composition and electronic equipment

Info

Publication number: CN114443846A
Application number: CN202210079337.0A
Authority: CN
Inventors: 李校林; 赵路伟; 伍晓思
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-01-24
Filing date: 2022-01-24
Publication date: 2022-05-06

Abstract

The invention belongs to the field of artificial intelligence natural language processing, and particularly relates to a classification method, a device and electronic equipment based on a multi-level text abnormal composition graph, which comprises the steps of obtaining a whole sentence text input BERT model to construct a multi-level text abnormal composition graph of a whole sentence, a clause and a word, and outputting a whole sentence characteristic representation; training a multi-level text heteromorphic graph by adopting a GAT network, and extracting total characteristic representation of whole sentences and clauses in the multi-level text heteromorphic graph; integrating the whole sentence characteristic expression and the total characteristic expression, and inputting an integration result into a self-attention module to calculate a self-attention parameter; multiplying the self-attention parameter by the fusion result to obtain a fusion vector representation, and inputting the fusion vector representation into a classification module to obtain a classification result; the method and the device split the whole sentence to construct the multi-level abnormal composition graph, learn the structural information of the multi-level abnormal composition graph by using the attention network, fuse the structural information with the whole sentence semantic representation obtained by BERT to obtain the enhanced text semantic representation and use the enhanced text semantic representation for final classification, and improve the text classification effect.

Description

Classification method and device based on multi-level text abnormal composition and electronic equipment

Technical Field

The invention belongs to the field of artificial intelligence natural language processing, and particularly relates to a classification method and device based on a multi-level text abnormal graph and electronic equipment.

Background

The development of internet technology and its related industries has provided people with many service items, one of which is from media social platforms and e-commerce. People can give their own opinions, opinions or feedback on product services on a network platform, and the comment texts form huge data information and have high research value. The text sentiment analysis is one of the popular research directions of natural language processing, and the sentiment polarity of an input text is judged by using a computer science technology to obtain the sentiment feedback of a reviewer.

With the continuous development of artificial intelligence technology, deep learning technology has become the mainstream research means in the field of text emotion analysis. The CNN network can extract local feature information of input text data by utilizing position invariance of the CNN network, and the RNN network can train output information at the previous moment and input information at the current moment as output information at the current moment according to time steps, so that an output vector model can be combined with context associated information fully, and the method is greatly helpful for emotion analysis of texts. However, in the training process of the CNN network, information outside a sliding window is lost due to convolution and pooling operations, the RNN network has the problems of gradient disappearance, explosion and the like, and both the CNN network and the RNN network cannot fully consider the structural information of the input text, so that the text data with complex emotion polarities is difficult to judge.

In the aspect of text preprocessing, although a traditional word2vec model is widely applied all the time, a word vector generated by the model after pre-training is fixed and invariable, namely, a unique word or word can only correspond to a unique dense vector representation, so that the word vector representation cannot be performed according to the context of a current input data set, and the problem of word ambiguity frequently occurring in a Chinese text cannot be solved.

Disclosure of Invention

In order to solve the problems, the invention provides a classification method, a classification device and an electronic device based on a multi-level text abnormal composition, wherein a BERT-GAT model is constructed and comprises a BERT model, a GAT layer, a self-attention module and a classification module, the structural information of the multi-level abnormal composition is automatically learned by utilizing an attention network, and finally, the structural information is fused with the whole sentence semantic representation obtained by the BERT, so that the final output vector fully combines the internal structural information and the context associated semantic information between texts and is used for final classification, and the classification precision is improved.

In a first aspect, the present invention provides a classification method based on multi-level text heteromorphic images, including the following steps:

s1, acquiring a whole sentence text, inputting the whole sentence text into a BERT model to construct a multi-level text abnormal graph of the whole sentence, clause and word, and outputting a whole sentence characteristic representation with whole sentence semantic information;

s2, training a multi-level text heteromorphic graph by adopting a GAT layer, and extracting total characteristic representation of a whole sentence text structural relation in the multi-level text heteromorphic graph;

s3, integrating the whole sentence characteristic expression and the total characteristic expression, and inputting an integration result into a self-attention module to calculate a self-attention parameter;

and S4, multiplying the self-attention parameter by the fusion result to obtain a fusion vector, and inputting the fusion vector into a classification module to obtain a classification result.

Further, constructing an adjacency matrix according to the structure of the multi-level text heteromorphic graph of the whole sentence-clause-word comprises the following steps:

dividing the whole sentence text into a plurality of clauses and a plurality of words, wherein the whole sentence text is used as a whole sentence node, the whole sentence node comprises N clause nodes, and one clause node comprises M word nodes;

and determining edges among the nodes according to the word and clause in the whole sentence text and the dependency relationship between the clause and the whole sentence, and constructing an adjacency matrix after the determination is finished.

Further, defining initial feature representations of different types of nodes in the multi-level text anomaly graph by using the BERT vocabulary, comprising the following steps:

for the word nodes, obtaining initial word representation by indexing a BERT word list;

for clause nodes, defining the clause nodes to an unused Token in a BERT word list to index and obtain initial clause expression;

and regarding the whole sentence node, taking the hidden state representation of the CLS position corresponding to the last layer of the BERT word list as the initial whole sentence representation.

Further, a main module of the GAT network is a multi-attention mechanism, and the multi-attention mechanism is used for training a multi-level text heteromorphic graph, and is expressed as:

wherein the content of the first and second substances,

indicating the state of the node at the current time t,

representing the normalized weight parameter in the attention mechanism,

and is a parameter matrix, | | represents splicing operation, K represents the kth attention mechanism, and K is the total number of the multi-head attention mechanisms.

In a second aspect, the present invention provides a classification apparatus based on a multi-level text heteromorphic graph, including:

the text processing module is used for dividing the acquired Chinese short text into a whole sentence, a plurality of clauses and a plurality of words;

the BERT model is used for acquiring data of the text processing module to construct a multi-level text heteromorphic graph of a whole sentence, a clause and a word, and outputting a characteristic representation of the whole sentence;

the GAT network is used for extracting the internal structure information of the text in the multi-level text heterogeneous graph;

the fusion module is used for fusing the whole sentence characteristic representation and the structure information in the text, inputting the fusion result into an attention mechanism to calculate an attention parameter, and multiplying the attention parameter and the fusion result to obtain a fusion vector representation;

and the classification module is used for obtaining the classification result of the Chinese short text according to the fusion vector representation of the fusion module.

In a third aspect, the present invention provides an electronic device, which includes a memory, and a processor, that is, a computer program stored in the memory and executable on the processor, where the processor, when executing the computer program, is capable of implementing the steps of the emotion classification method based on a multi-level text heteromorphic image, that is, implementing the classification method based on the multi-level text heteromorphic image provided in the first aspect of the present invention.

The invention has the beneficial effects that:

the invention provides an emotion classification model of a multi-level text heteromorphic image, which is used for splitting a whole sentence, constructing the multi-level heteromorphic image of the whole sentence, a clause and a word, automatically learning the structural information of the multi-level heteromorphic image by using an image attention network, and finally fusing the structural information with the whole sentence semantic representation obtained by BERT, so that the final output vector is fully combined with the internal structural information and context associated semantic information among texts, and the enhanced text semantic representation is obtained and used for final classification.

A short text can only express one emotion position but can have multiple emotion polarities in the context, the text data is expressed in the form of a multi-level text heterogeneous graph, so that the data can express the structural relationship in the text, the text vector in the form of the multi-level text heterogeneous graph is sent to a GAT network for training, and then the obtained vector sequence is spliced and fused with the whole sentence semantic expression vector output by BERT training, so that the final output vector can fully combine the structural information in the text and the associated semantic information of the text context, and the classification effect of the text is effectively improved.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a model structure diagram of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In short text sentiment classification tasks, the general sentiment tendency of input short texts is generally required to be judged, and the existing sentiment analysis model is easy to correctly classify texts with single sentiment tendency. In practice, however, a piece of text often contains a plurality of emotional tendencies, especially for comment data, a user may express good and bad aspects in the comment at the same time, similar to "price is higher, service is general, outlook is scenery, breakfast is good. In the complex emotion situation, the method that the existing model directly models the whole sentence text increases the difficulty of distinguishing the emotion of the model, so that the emotion analysis effect is poor. The invention provides an emotion classification method and device based on a multi-level text heteromorphic image and electronic equipment, which are used for realizing emotion classification of a short text.

The invention provides an emotion classification method based on a multi-level text heteromorphic graph, which is used for constructing a BERT-GAT model, wherein the model comprises a BERT model, a GAT layer, a self-attention module and a classification module, and the emotion classification method comprises the following steps as shown in figure 1:

s2, training a multi-level text heterogeneous graph by adopting a GAT layer, and extracting structural information characteristic representation inside the text in the multi-level text heterogeneous graph;

s3, fusing structural information characteristic representation and whole sentence characteristic representation in the text, namely whole sentence context associated semantic information representation, and inputting a fusion result into a self-attention module to calculate a self-attention parameter;

Preferably, as shown in fig. 2, the training process of the BERT-GAT model includes:

s11, inputting a whole sentence text S into a BERT model, constructing a multi-level text abnormal composition of the whole sentence, a clause and a word according to the structure of the whole sentence text S, and simultaneously outputting the hidden state of the whole sentence CLS position representing the context associated information of the whole sentence by the BERT model;

s12, sending the multi-level text heterogeneous graph into a GAT network for training, wherein the GAT network extracts the total feature representation of the text internal structure information in the multi-level text heterogeneous graph, and the total feature representation is the structure feature of the multi-level heterogeneous graph which is fused;

specifically, the GAT network is used for training text data in the form of a multi-level text heterogeneous graph, and the output vector sequence has a structural relationship inside the text.

S13, splicing and fusing the hidden state of the whole-sentence CLS position representing the whole-sentence context associated information and the total feature representation of the text internal structure information, calculating a fusion result by adopting a self-attention mechanism, and performing normalization processing on the calculation result through a softmax layer to obtain a self-attention parameter;

s14, multiplying the self-attention parameter by the fusion result in the S13 to obtain fusion vector representation, and classifying the fusion vector representation through a linear layer with a softmax activation function to obtain a final prediction result;

specifically, the self-attention parameter is multiplied by the fusion result in S13 to output the result with the calculated attention parameter, so that the important information has more information and the degree of influence on the classification is larger. The word "happy" will result in larger parameters, so that the probability of classifying as positive emotion is significantly increased.

S15, judging whether the maximum training times is reached, if so, finishing the training, and otherwise, returning to the step S11.

Specifically, model parameters are constantly changing in the training process (including a GAT network, an attention mechanism, a full-connection network and the like), and the accuracy of classification is constantly improving, and the model is optimized by training and learning.

Each training is performed again after updating the weight parameters, i.e., step S11, each training is performed independently, and the training set and the test set are randomly extracted from the data set, each time there is a portion that varies but is repeated.

In an embodiment, the data sets used are an e-commerce comment data set and a hotel comment data set, and compared with other short text data sets, for comment texts of products or services, the emotion polarity of corresponding text contents is relatively complex, and it often appears that "the mobile phone is still good, that is, the battery is not durable" and thus contains text data with different emotion polarities, but the text specification degree in such data sets is relatively good, and compared with comment data sets (such as microblog, twitter and the like) of a social platform or self media, the text contents contain less information such as emoticons, network expressions, special symbols and the like, the length of the text is also generally long, and the method is relatively suitable for construction and further learning of multi-level text different composition.

In particular, other social media data sets such as microblog data sets are also suitable for the method of the invention.

The main ideas of constructing the input text S into a multi-level text heteromorphic graph include:

the multi-level text heteromorphic graph consists of three types of nodes, namely a whole sentence node, a clause node and a word node; the sentence nodes represent the whole sentence, the clause nodes represent clauses contained in the whole sentence, and the word nodes represent words in the whole sentence.

The text S is divided into a plurality of clauses and a plurality of words, the text S is assumed to be a whole sentence node, the whole sentence node comprises N clause nodes, and one clause node comprises M word nodes.

Whether an edge exists between two nodes is determined through the dependency relationship. Clauses are contained within a complete sentence, there is an edge between the clause and the complete sentence, and whether there is an edge between a word and a clause depends on whether the word belongs to a clause.

The expression mode of the nodes in the multi-level text heterogeneous graph is as follows: for the word node, obtaining an initial word representation by indexing a word vector in BERT; for clause nodes, indexing to obtain initial clause representations by defining the clause nodes into unused tokens in BERT; and regarding the complete sentence node, taking the hidden state representation of the CLS position corresponding to the last layer of the BERT as the initial complete sentence representation.

For constructing a heteromorphic graph, the most important thing is to construct an adjacency matrix for representing the relationship among nodes and the initial characteristic representation of the nodes.

In one embodiment, the process of constructing the adjacency matrix includes:

firstly, an initial node index list is defined, the maximum sequence length of the whole sentence is set to be 100, the maximum clause length is set to be 20, the maximum clause number is set to be 10, and the maximum node number is the product of the maximum clause length and the maximum clause number and is set to be 200. The composition of the node index list is represented as: main sentence + clause + word. The main sentence and the clause can be defined firstly, and the words need to be found according to the specific text content, so that the index table of the main sentence and the clause is constructed firstly. The main sentence nodes are well defined, the clause nodes can be well defined in advance without specific contents directly in tokens which are not used by the BERT (the invention needs to acquire the structural relationship between the clause nodes and the word nodes), the word nodes need specific analysis of specific conditions, and the BERT word list is indexed according to the contents of the current text.

Defining node _ name list to store node index, edge _ index list to store edge. When edges (corresponding to the subordinate relations between words and clauses or between clauses and main sentences in the text) exist between the nodes, the indexes in the node _ name list are indexed, and the relationships of the edges between the nodes are added in the edge _ index list.

After obtaining the edge _ index, it needs to be converted into an adjacency matrix, where the adjacency matrix is 200 by 200, rows and columns represent nodes in Node _ name, and when there are edges between two nodes, the corresponding position of the adjacency matrix is 1, and if there is no edge between two nodes, the corresponding element in the adjacency matrix is 0.

Specifically, when there is an edge at node 8 and node 14, then the element in row 8, column 14 and the element in row 14, column 8 of the adjacency matrix are both 1.

In one embodiment, the graph notes that the network needs to continuously update the node features during the learning process, and the initial definition is realized through BERT vocabulary. Defining initial feature representations of different types of nodes in a multi-level text anomaly graph by using a BERT vocabulary, comprising:

for the main sentence node, the hidden state of the CLS position corresponding to the last layer of the BERT vocabulary can best represent the whole sentence, so the hidden state is used for representing the initial main sentence node characteristic, a list node _ fea is defined when the whole sentence node characteristic is constructed, and the first element is 101 and corresponds to the CLS in the BERT vocabulary. Specifically, the BERT automatically adds a CLS symbol in the pre-training process of processing a text classification task, and the corresponding output vector is a semantic representation of the whole text and can be well represented as a whole sentence node.

For clause nodes, the position occupation of 10 clauses is directly defined in a word list corresponding to the BERT in advance (the BERT word list can be found in BERT-base-Chinese/vocab. txt), so that the initial characteristic representation of the clauses can be directly obtained when the BERT word list is indexed. The initial clause representation is obtained by indexing in Token which is not used in BERT, because the structural information of the text is introduced by a graph, only which words belong to the first clause and which words belong to the second clause need to be known, and only one learnable vector is needed to represent the clauses.

For word nodes, initial characteristic representation of the nodes can be obtained by directly querying the BERT word list.

After preparing a multi-level text abnormal graph and an initial node representation, sending the multi-level text abnormal graph and the initial node representation into a GAT network to extract features. The main module of GAT is a multi-head attention mechanism, and the calculation formula is as follows:

wherein the content of the first and second substances,

indicating the state of the node at the current time,

representing the normalized weight parameter in the attention mechanism,

The traditional neural network can sequence the nodes and edges in the graph according to a certain sequence when processing the data of the graph structure and then further process the data, so that the data loses the original graph structure, the graph neural network can acquire the dependency or dependency relationship of each element in the graph through the information among the nodes, and the graph structure can not be damaged during training, so that the graph is required to be sent to a GAT network for learning after the construction of a multi-level text abnormal graph is completed, and the CNN and the RNN cannot achieve the effect.

Preferably, the feature representation with the text internal structure output by the GAT (at this time, the structural feature of the multilevel heteromorphic graph is fused) and the hidden state with the whole sentence context associated semantic information representation at the CLS position obtained by the BERT are directly spliced and fused, the fused data are calculated by using a self-attention mechanism, the output result is normalized by a softmax layer to obtain a self-attention parameter, the self-attention parameter is multiplied by the fusion vector output by the GAT network and the BERT, the obtained output vector fully fuses the whole sentence semantic representation vector output by the BERT and the semantic representation vector output by the GAT network with the sentence internal structure feature, and the classification accuracy of the improved model is greatly improved. And then discarding a part of neurons through a dropout operation to prevent overfitting, and classifying the final output result through a linear layer to obtain a final prediction result.

A classification device based on a multi-level text abnormal figure comprises:

the BERT model is used for acquiring data of the text processing module to construct a multi-level text abnormal graph of a whole sentence, a clause and a word and outputting a whole sentence characteristic representation with whole sentence semantic information;

An electronic device comprises a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, can implement the steps of the emotion classification method based on the multi-level text abnormal composition, that is, implement the classification method based on the multi-level text abnormal composition provided by the invention.

The feasibility of the proposed model scheme is verified by means of comparative experiments. The following 6 comparative experiments were set up:

BERT: only using a BERT model to process the whole sentence text data, and directly using the obtained training result for classification;

TextGAT: constructing a multi-level text isomerism by using a BERT word list only, training the multi-level text isomerism through a GAT network, and directly using an obtained result for classification;

TextCNN: firstly, training by using a BERT model, outputting a result, sending the result into a TextCNN network for learning, and classifying the obtained result through a linear layer;

BiGRU: firstly, training by using a BERT model, outputting a result, sending the result into a BiGRU network for learning, and classifying the obtained result through a linear layer;

BilSTM: firstly, training by using a BERT model, outputting a result, sending the result into a BilSTM network for learning, and classifying the obtained result through a linear layer;

BERT-GAT: the model provided by the invention constructs a multi-level text abnormal composition graph by using data in a BERT (basic transcription) word list, then sends the multi-level text abnormal composition graph into a GAT (global evolution-advanced) network for training, simultaneously trains a current text data set by using the BERT model, fuses the outputs of the BERT model and the current text data set by using a self-attention mechanism, and then sends the fused outputs into a linear layer for classification;

TABLE 1 comparative results

As can be seen from the comparison results in Table 1, the whole sentence context correlation information of the text and the internal sentence structure information cannot be fully combined by only using the BERT or GAT network for training, while the traditional neural network directly models the whole sentence text data, and the self judgment difficulty is increased when processing the text data with more complex emotion polarity.

The evaluation indexes commonly used in NLP tasks are: accuracy (accuracy), precision (precision), recall (recall), F1 values.

The accuracy is the proportion of correct judgment in all judgments, which represents the proportion of samples with correct prediction results in the total samples, and the calculation formula is as follows:

the precision rate is also called precision rate, and it represents the proportion of the correctly predicted samples in all the samples predicted as positive according to the predicted result, and its calculation formula is:

the recall rate is called recall rate, and represents the proportion of the correctly predicted positive samples in all the actually positive samples, and the calculation formula is as follows:

the F1 value is a comprehensive evaluation index after combining the accuracy rate and the recall rate, the larger the value is, the better the value is, and the calculation formula is as follows:

in the four formulas, tp (true positives) represents a sample correctly predicted as a positive class in a positive class sample, fp (false positives) represents a sample incorrectly predicted as a positive class in a negative class sample, fn (false negatives) represents a sample incorrectly predicted as a negative class in a positive class sample, and tn (true negatives) represents a sample correctly predicted as a negative class in a negative class sample.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A classification method based on a multi-level text heteromorphic graph is characterized in that a BERT-GAT model is constructed, the model comprises a BERT model, a GAT layer, a self-attention module and a classification module, and the emotion classification method comprises the following steps:

2. The method for classifying the multilevel text heteromorphic graph according to claim 1, wherein constructing the adjacency matrix according to the structure of the multilevel text heteromorphic graph of the whole sentence-clause-word comprises:

3. The method for classifying based on the multilevel text heteromorphic graph according to claim 1, wherein the step of defining the initial feature representation of different types of nodes in the multilevel text heteromorphic graph by using a BERT vocabulary comprises the following steps:

for the word node, obtaining an initial word representation by indexing a BERT word list;

4. The classification method based on the multilevel text heteromorphic graph according to claim 1, wherein a main module of the GAT network is a multi-attention mechanism, the multilevel text heteromorphic graph is trained by the multi-attention mechanism, and the GAT network represented by the multi-attention mechanism is as follows:

wherein the content of the first and second substances,

indicating the state of the node at the current time,

representing the normalized weight parameter in the attention mechanism,

5. A classification device based on multi-level text abnormal picture is characterized by comprising:

the fusion module is used for fusing structural information and sentence characteristic representation in the text, inputting a fusion result into an attention mechanism to calculate an attention parameter, and multiplying the attention parameter and the fusion result to obtain a fusion vector representation;

6. The classification device based on the multilevel text heteromorphic graph according to claim 5, wherein a main module of the GAT network is a multi-attention mechanism, the multilevel text heteromorphic graph is trained by the multi-attention mechanism, and the GAT network represented by the multi-attention mechanism is as follows:

wherein the content of the first and second substances,

indicating the state of the node at the current time,

representing the normalized weight parameter in the attention mechanism,

7. An electronic device comprising a memory, a processor, a computer program stored on the memory and executable on the processor, wherein the processor when executing the computer program implements the steps of the emotion classification method based on multi-level text heteromorphic image according to any one of claims 1-4.