CN115080689B

CN115080689B - Hidden space data enhanced multi-label text classification method based on fusion label association

Info

Publication number: CN115080689B
Application number: CN202210679320.9A
Authority: CN
Inventors: 线岩团; 苗育华; 王红斌; 文永华
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2022-06-15
Filing date: 2022-06-15
Publication date: 2024-05-07
Anticipated expiration: 2042-06-15
Also published as: CN115080689A

Abstract

The invention discloses a hidden space data enhancement multi-label text classification method based on fusion label association, which comprises the steps of encoding data in batches, training through bidirectional LSTM and attention, mining priori knowledge in a label list, finally carrying out a hidden space matching method on the encoded data and the obtained tag priori knowledge, constructing a batch of virtual data in the hidden space, and carrying out perfect training on a multi-label text model to finish multi-label text classification; compared with other deep learning models, the method has better performance on the main evaluation index micro_f1. The micro_F1 of the method reaches 72.08 percent, which is improved by 5.18 percent, 3.28 percent and 2.38 percent on the micro_F1 value compared with the traditional machine learning methods BR, CC and LP algorithm, and is improved by 3.78 percent, 2.38 percent and 1.08 percent on the micro_F1 value compared with LSTM, CNN-RNN and SGM in a neural network model.

Description

Hidden space data enhanced multi-label text classification method based on fusion label association

Technical Field

The invention relates to a hidden space data enhanced multi-label text classification method combined with label association, and relates to the technical field of natural language processing.

Background

Text classification is an important and classical problem in natural language processing, namely classifying texts according to certain rules. The text classification is only identified in huge amount of text information by using the traditional manual classification and probability statistics method, the consumed resources are countless, and with the current rapid increase of the data volume, each classification needs finer granularity division, and the situation that one sample is related to a plurality of classifications exists, so that the traditional single-label text classification cannot well reach the expectations of people. Thus, research into multi-tag text classification has been developed. Tag text classification is a subtask of text classification, which is to select a specific tag from a set of tags, and assign each instance a subset of the most relevant class tags. The multi-tag classification has many practical applications in real life, and the multi-tag text classification method can accurately locate the category of public opinion by processing the subject when facing news data containing multi-tags. The task is also applicable to product tag classification for e-commerce, text annotation for biomedical, category tag classification for wikipedia, and the like.

Compared with single-label classification, the multi-label classification method can be better suitable for actual life and accords with the characteristics and rules of objective objects. However, in the actual text, the number of the categories of the labels is quite large, and some labels have few related contents, so that the problem of large label unbalance is caused, and the output space of the labels is exponentially increased along with the number of the labels. For all multi-label text classification problems, when finer-granularity label classification is required, the problems of increased number of labels and unbalanced labels are still to be studied. The existing method often ignores the correlation among the labels, only considers the influence of different labels on the same text, and therefore does not well dig out the relation among a plurality of labels related to the text. Therefore, a hidden space data enhanced multi-label text classification method combined with label association is provided.

Disclosure of Invention

The invention aims to provide a hidden space data enhancement multi-label text classification method based on fusion label association, which is characterized in that data in batches are encoded, training is carried out through bidirectional LSTM and attention, priori knowledge in a label list is mined, finally, the encoded data and the obtained tag priori knowledge are subjected to a hidden space matching method, virtual data under a batch of hidden spaces are constructed, and then, a multi-label text model is subjected to perfect training, so that multi-label text classification is completed.

In order to achieve the technical purpose and the technical effect, the invention is realized by the following technical scheme:

The hidden space data enhanced multi-label text classification method integrating label association comprises the steps of preprocessing a data set and a label relation to mine out priori knowledge of the label; constructing a multi-label text classification model based on an attention mechanism; the prior knowledge of the tag is matched with the existing data, and equivalent contact data of the tag is changed into a batch of new virtual data in the hidden space; and performing perfect training on the multi-label text model to finish multi-label text classification.

Further, the hidden space data enhancement multi-label text classification method associated with the fusion label comprises the following steps:

s1: preprocessing data and tags thereof in a dataset by adopting a python language programming program, and processing stop words and tags in texts so that each text and the tags thereof are stored in a csv file in a corresponding manner according to rows;

Counting the related labels and text numbers, calculating the number of times of each label appearing mutually, and finding out prior knowledge of various label relations through mining training data;

s2: sequentially performing word embedding and encoding on the texts, simultaneously mining contact data corresponding to the current texts in the original training batch by matching with priori knowledge, expanding the data in the original batch, and extracting the characteristics corresponding to the texts in the batch and the text characteristics related to the labels through attention layers;

S3: based on the prior knowledge related to the mined label and the text characteristics, the cross fusion is carried out, so that the label characteristics and the text characteristics of the contact data are changed to become a batch of virtual data in the hidden space;

S4: the original cross entropy loss function is modified, the enhanced data and the original data are put into a multi-label classification model for training, the loss of virtual data in the hidden space and the loss obtained by the original data are combined through a certain ratio, the classification model is continuously perfected, and a multi-label text classification result is obtained.

Further, the S1 obtains the label co-occurrence matrix by downloading the original dataset of AAPD disclosed on the network, preprocessing, providing an example sample { S ₁,S₂,S₃,S₄ }, and a label representation of the sample under the space of the label { L ₁,L₂,L₃,L₄ }, and counting the number of mutual occurrences among the labelsThe influence of the labels on the labels is 0, and then the scoring matrix L among the labels under the sample is obtained through normalizing the rows of the matrix.

Further, the step S2 is a strategy of constructing a multi-label text classification model based on an attention mechanism, firstly, positioning 128 the data amount in the training batch before the model is transmitted, and mining out the contact data corresponding to the original text in the batch data in combination with a priori knowledge, so that the data amount in the batch is expanded to 256;

Then, word embedding processing is carried out on the input text through a word embedding module, so as to obtain embedded representation of labels and text words, a Glove word bag disclosed by Stanford university is downloaded and used, and words { w ₁,w₂,…,w_n } in the text are converted into word vector representations x= { x ₁,x₂,…,x_n } through a word embedding matrix and a label embedding matrix by using a 100d-Glove method, wherein x _i is the word vector representation of an ith word; passing x _i through an embedding matrix V ε R ^k×|w|, where |w| is the size of the vocabulary and k is the dimension of the embedded vector;

The text sequence x is then read from both directions using the bi-directional LSTM and the hidden representation of each word is calculated as follows:

By connecting the hidden states in two directions, the final hidden representation of the ith word is obtained Contains sequence information centered on the i-th word;

For pass attention layers, the context feature of each word is extracted using 4 multi-headed self-attention mechanisms; given a vector of a sequence A single head self-attention projects H to three different matrices: q matrix is/>K matrix is/>V matrix is/>The dimension of the output matrix is/>The scaled dot product attention is then used to obtain an output representation:

Q，K，V＝HW^Q,HW^K,HW^V

further, the S3 performs cross fusion with text features based on priori knowledge related to the mined labels, and performs data enhancement on 128 pieces of contact data and feature vectors of the labels transmitted in each batch by splitting original data and contact data in the S2 batch;

Further, the data enhancement is implemented by combining the hidden space-based data representation S ₁＝{h₁,h₂,…,h_n which is already obtained by the original data in the current batch and the hidden space-based data representation obtained by the corresponding contact data in a certain proportion to obtain the text data characteristic representation of the new virtual data For the tag characteristics of the hidden space corresponding to the virtual data, reserving a place where the original data is consistent with the tag in the virtual data, and inquiring the place where the original data is inconsistent with the tag in the virtual data through a tag score matrix L in the S1 to obtain the influence score of other tags of the current data on the tag, and further, constructing the tag characteristics of the virtual data through Bernoulli distributed random sampling to finally construct virtual data based on the hidden space;

further, the step S4 modifies an original cross entropy loss function, and the loss function formula is as follows:

Wherein Ω _neg and Ω _pos represent positive and negative sets of classes of the sample, s _i is the score of the i-th class in the non-target class, s _j is the score of the j-th class in the target class, and the additional class 0 score s ₀ is set to have a threshold of 0.

Finally, taking the implicit representation of the original data and the virtual data as the final hidden state of the model to obtain text characteristic representation related to the self text: c= { c ₁,c₂,…,c_k };

Further, the step S4 is to put the enhanced data and the original data into a multi-label classification model for training, and the formula is as follows:

obtaining multi-label text classification result, and losing virtual data under hidden space Loss of raw data/>Bonding is performed through the ratio lambda.

The invention has the beneficial effects that:

According to the hidden space data enhancement multi-label text classification method based on fusion label association, data in batches are encoded, training is carried out through bidirectional LSTM and attention, priori knowledge in a label list is mined, finally, a hidden space matching method is carried out on the encoded data and the obtained tag priori knowledge, virtual data under a batch of hidden spaces are constructed, and then, perfect training is carried out on a multi-label text model, so that multi-label text classification is completed;

according to the hidden space data enhancement multi-label text classification method based on the fusion label association, the relation between label structures of all texts is captured, and the problem of label imbalance in multi-label classification is effectively solved through the hidden space data enhancement method;

Compared with other deep learning models, the method has better performance on the main evaluation index micro_f1. The micro_F1 of the method reaches 72.08 percent, which is improved by 5.18 percent, 3.28 percent and 2.38 percent on the micro_F1 value compared with the traditional machine learning methods BR, CC and LP algorithm, and is improved by 3.78 percent, 2.38 percent and 1.08 percent on the micro_F1 value compared with LSTM, CNN-RNN and SGM in a neural network model.

Of course, it is not necessary for any one product to practice the invention to achieve all of the advantages set forth above at the same time.

Drawings

FIG. 1 is a model diagram of a hidden space data enhanced multi-label text classification method associated with a fusion label according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating preprocessing of data sets and label relationships according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating cross-fusion of text features based on prior knowledge about mined labels according to an embodiment of the present invention;

Detailed Description

In order to more clearly describe the technical scheme of the embodiment of the present invention, the embodiment of the present invention will be described in detail below with reference to the accompanying drawings.

Example 1

As shown in fig. 1, the specific implementation steps of the hidden space data enhanced multi-label text classification method associated with the fusion label are as follows:

The abstract and corresponding subject matter of 55840 papers in the field of computer science are included for research purposes by downloading the original dataset of AAPD disclosed on the web. There are 54 classes of class labels in total, and the specific dataset distribution is shown in table 1.

TABLE 1 Experimental data set partitioning (Unit: bar)

The data preprocessing is realized by adopting a python language programming program, and stop words and labels in texts are processed, so that each text and the labels thereof are correspondingly stored in a csv file according to rows;

As shown in FIG. 2, S1 obtains a label co-occurrence matrix by downloading the original dataset of AAPD disclosed on the network, preprocessing, giving an example sample { S ₁,S₂,S₃,S₄ }, and a label representation of the sample under the space of the label { L ₁,L₂,L₃,L₄ }, and counting the number of mutual occurrences among the labels The influence of the labels on the labels is 0, and then the scoring matrix L among the labels under the sample is obtained through normalizing the rows of the matrix.

the method comprises the steps of constructing a strategy of a multi-label text classification model based on an attention mechanism, firstly positioning 128 data volume in a training batch before the data is transmitted into the model, and mining out contact data corresponding to an original text in batch data in combination with priori knowledge, so that the data volume in the batch is expanded to 256;

Q,K,V＝HW^Q,HW^K,HW^V

as a preferred solution of the present invention, in Step3, a cross-fusion strategy is performed based on the a priori knowledge related to the mined label and the text feature, as shown in fig. 3:

Based on the prior knowledge related to the mined label and text characteristics, carrying out cross fusion, and carrying out data enhancement on 128 pieces of contact data and feature vectors of the label which are transmitted into each batch by cutting the original data and the contact data in S2 batch;

S4, modifying an original cross entropy loss function, wherein a loss function formula is as follows:

Example 2

Based on the hidden space data enhancement multi-label text classification method related to the fusion label in the embodiment 1 of the invention, the data imbalance in multi-label text classification is taken as a problem, a hidden space data enhancement multi-label text classification model related to the fusion label is designed through hidden space data enhancement, and the model is compared with the following baseline model:

BR (Binary Relevance): the multi-label classification task is converted into a plurality of single-label classification problems by ignoring the correlation between labels and building a separate classifier for each label.

CC (Classifier Chains): the algorithm converts the multi-tag classification task into a binary classification problem chain in which subsequent binary classifiers based on previous predictions affect subsequent tags if the previous tag prediction is incorrect and take into account higher order tag correlation.

LP (Label Powerset): the multi-label problem is converted into a multi-class problem, and all unique label combinations are trained by using one multi-class classifier.

CNN-RNN: the model fuses the CNN and the RNN, firstly sends word vectors into the CNN to obtain text feature sequences, and then inputs the features into the RNN to obtain corresponding prediction labels. The CNN and RNN are utilized to capture global and local text semantics and model tag relevance.

LSTM: a long and short term memory network is applied to take into account the sequential structure of the text and to alleviate the problem of explosion and vanishing gradients.

SGM: the multi-tag classification task is regarded as a sequence generation problem, the correlation between tags is considered using the generated ideas, and seq2seq is used as a multi-class classifier.

The results and analysis of the multi-labeled text classification experiment based on AAPD (abstract and corresponding topic of 55840 papers included in the field of computer science) datasets were as follows:

table 2 multi-tag text classification test results

Table 2 shows that the proposed method has better performance on the main evaluation index micro_F1 compared to other deep learning models. The micro_F1 of the method reaches 72.08 percent, which is improved by 5.18 percent, 3.28 percent and 2.38 percent on the micro_F1 value compared with the traditional machine learning methods BR, CC and LP algorithm, and is improved by 3.78 percent, 2.38 percent and 1.08 percent on the micro_F1 value compared with LSTM, CNN-RNN and SGM in a neural network model.

In the description of the present specification, the descriptions of the terms "one embodiment," "example," "specific example," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The preferred embodiments of the invention disclosed above are intended only to assist in the explanation of the invention. The preferred embodiments are not exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. The invention is limited only by the claims and the full scope and equivalents thereof.

Claims

1. The hidden space data enhanced multi-label text classification method based on fusion label association is characterized in that: preprocessing a data set and a label relation to mine out priori knowledge of the label; constructing a multi-label text classification model based on an attention mechanism; the prior knowledge of the tag is matched with the existing data, and equivalent contact data of the tag is changed into a batch of new virtual data in the hidden space; performing perfect training on the multi-label text model to finish multi-label text classification; the method specifically comprises the following steps:

S1: preprocessing data and tags thereof in a dataset by adopting a python language programming program, and processing stop words and tags in texts so that each text and the tags thereof are stored in a csv file in a corresponding manner according to rows; counting the related labels and text numbers, calculating the number of times of each label appearing mutually, and finding out prior knowledge of various label relations through mining training data; by downloading AAPD of the original dataset disclosed on the network, preprocessing, and then obtaining a label co-occurrence matrix by giving an example sample { S ₁,S₂,S₃,S₄ }, and a label representation of the sample under the space of labels { L ₁,L₂,L₃,L₄ }, and counting the number of mutual occurrences among the labels The influence of the labels on the labels is 0, and then the scoring matrix L among the labels under the sample is obtained through normalizing the rows of the matrix;

S2: sequentially performing word embedding and encoding on the texts, simultaneously mining contact data corresponding to the current texts in the original training batch by matching with priori knowledge, expanding the data in the original batch, and extracting the characteristics corresponding to the texts in the batch and the text characteristics related to the labels through attention layers; the method comprises the steps of constructing a strategy of a multi-label text classification model based on an attention mechanism, firstly positioning 128 data volume in a training batch before the data is transmitted into the model, and mining out contact data corresponding to an original text in batch data in combination with priori knowledge, so that the data volume in the batch is expanded to 256;

Then, word embedding processing is carried out on the input text through a word embedding module, so as to obtain embedded representation of labels and text words, a Glove word bag disclosed by Stanford university is downloaded and used, and words { w ₁,w₂,…,w_n } in the text are converted into word vector representations x= { x ₁,x₂,…,x_n } through a word embedding matrix and a label embedding matrix by using a 100d-Glove method, wherein x _i is the word vector representation of an ith word; and passing x _i through an embedding matrix Where |w| is the size of the vocabulary, k is the dimension of the embedded vector;

S3: based on the prior knowledge related to the mined label and the text characteristics, the cross fusion is carried out, so that the label characteristics and the text characteristics of the contact data are changed to become a batch of virtual data in the hidden space; the method specifically comprises the steps of carrying out cross fusion on the basis of priori knowledge related to the mined labels and text characteristics, and carrying out data enhancement on 128 pieces of contact data and feature vectors of the labels which are transmitted into each batch by cutting the original data and the contact data in S2 batch; the data enhancement is represented by hidden space based data that has been obtained for raw data in the current batch Combining the hidden space-based data representations obtained by the corresponding contact data in a certain proportion to obtain the text data characteristic representation/>, of the new virtual dataFor the tag characteristics of the hidden space corresponding to the virtual data, reserving a place where the original data is consistent with the tag in the virtual data, and inquiring the place where the original data is inconsistent with the tag in the virtual data through a tag score matrix L in the S1 to obtain the influence score of other tags of the current data on the tag, and further, constructing the tag characteristics of the virtual data through Bernoulli distributed random sampling to finally construct virtual data based on the hidden space;

2. The fusion tag-associated hidden space data enhanced multi-tag text classification method of claim 1, wherein: and S4, modifying an original cross entropy loss function, wherein a loss function formula is as follows:

Wherein Ω _neg and Ω _pos represent positive and negative class sets of the sample, s _i is the score of the i-th class in the non-target class, s _j is the score of the j-th class in the target class, and for the additional class 0 score s ₀, the threshold is made to be 0;

Finally, taking the implicit representation of the original data and the virtual data as the final hidden state of the model to obtain text characteristic representation related to the self text: c= { c ₁,c₂,…,c_k }.

3. The fusion tag-associated hidden space data enhanced multi-tag text classification method of claim 2, wherein: and S4, the enhanced data and the original data are put into a multi-label classification model for training, and the formula is as follows:

obtaining multi-label text classification result, and losing virtual data under hidden space Loss of raw dataBonding is performed through the ratio lambda.