CN115080689B - Hidden space data enhanced multi-label text classification method based on fusion label association - Google Patents

Hidden space data enhanced multi-label text classification method based on fusion label association Download PDF

Info

Publication number
CN115080689B
CN115080689B CN202210679320.9A CN202210679320A CN115080689B CN 115080689 B CN115080689 B CN 115080689B CN 202210679320 A CN202210679320 A CN 202210679320A CN 115080689 B CN115080689 B CN 115080689B
Authority
CN
China
Prior art keywords
data
label
text
tag
batch
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210679320.9A
Other languages
Chinese (zh)
Other versions
CN115080689A (en
Inventor
线岩团
苗育华
王红斌
文永华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202210679320.9A priority Critical patent/CN115080689B/en
Publication of CN115080689A publication Critical patent/CN115080689A/en
Application granted granted Critical
Publication of CN115080689B publication Critical patent/CN115080689B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a hidden space data enhancement multi-label text classification method based on fusion label association, which comprises the steps of encoding data in batches, training through bidirectional LSTM and attention, mining priori knowledge in a label list, finally carrying out a hidden space matching method on the encoded data and the obtained tag priori knowledge, constructing a batch of virtual data in the hidden space, and carrying out perfect training on a multi-label text model to finish multi-label text classification; compared with other deep learning models, the method has better performance on the main evaluation index micro_f1. The micro_F1 of the method reaches 72.08 percent, which is improved by 5.18 percent, 3.28 percent and 2.38 percent on the micro_F1 value compared with the traditional machine learning methods BR, CC and LP algorithm, and is improved by 3.78 percent, 2.38 percent and 1.08 percent on the micro_F1 value compared with LSTM, CNN-RNN and SGM in a neural network model.

Description

Hidden space data enhanced multi-label text classification method based on fusion label association
Technical Field
The invention relates to a hidden space data enhanced multi-label text classification method combined with label association, and relates to the technical field of natural language processing.
Background
Text classification is an important and classical problem in natural language processing, namely classifying texts according to certain rules. The text classification is only identified in huge amount of text information by using the traditional manual classification and probability statistics method, the consumed resources are countless, and with the current rapid increase of the data volume, each classification needs finer granularity division, and the situation that one sample is related to a plurality of classifications exists, so that the traditional single-label text classification cannot well reach the expectations of people. Thus, research into multi-tag text classification has been developed. Tag text classification is a subtask of text classification, which is to select a specific tag from a set of tags, and assign each instance a subset of the most relevant class tags. The multi-tag classification has many practical applications in real life, and the multi-tag text classification method can accurately locate the category of public opinion by processing the subject when facing news data containing multi-tags. The task is also applicable to product tag classification for e-commerce, text annotation for biomedical, category tag classification for wikipedia, and the like.
Compared with single-label classification, the multi-label classification method can be better suitable for actual life and accords with the characteristics and rules of objective objects. However, in the actual text, the number of the categories of the labels is quite large, and some labels have few related contents, so that the problem of large label unbalance is caused, and the output space of the labels is exponentially increased along with the number of the labels. For all multi-label text classification problems, when finer-granularity label classification is required, the problems of increased number of labels and unbalanced labels are still to be studied. The existing method often ignores the correlation among the labels, only considers the influence of different labels on the same text, and therefore does not well dig out the relation among a plurality of labels related to the text. Therefore, a hidden space data enhanced multi-label text classification method combined with label association is provided.
Disclosure of Invention
The invention aims to provide a hidden space data enhancement multi-label text classification method based on fusion label association, which is characterized in that data in batches are encoded, training is carried out through bidirectional LSTM and attention, priori knowledge in a label list is mined, finally, the encoded data and the obtained tag priori knowledge are subjected to a hidden space matching method, virtual data under a batch of hidden spaces are constructed, and then, a multi-label text model is subjected to perfect training, so that multi-label text classification is completed.
In order to achieve the technical purpose and the technical effect, the invention is realized by the following technical scheme:
The hidden space data enhanced multi-label text classification method integrating label association comprises the steps of preprocessing a data set and a label relation to mine out priori knowledge of the label; constructing a multi-label text classification model based on an attention mechanism; the prior knowledge of the tag is matched with the existing data, and equivalent contact data of the tag is changed into a batch of new virtual data in the hidden space; and performing perfect training on the multi-label text model to finish multi-label text classification.
Further, the hidden space data enhancement multi-label text classification method associated with the fusion label comprises the following steps:
s1: preprocessing data and tags thereof in a dataset by adopting a python language programming program, and processing stop words and tags in texts so that each text and the tags thereof are stored in a csv file in a corresponding manner according to rows;
Counting the related labels and text numbers, calculating the number of times of each label appearing mutually, and finding out prior knowledge of various label relations through mining training data;
s2: sequentially performing word embedding and encoding on the texts, simultaneously mining contact data corresponding to the current texts in the original training batch by matching with priori knowledge, expanding the data in the original batch, and extracting the characteristics corresponding to the texts in the batch and the text characteristics related to the labels through attention layers;
S3: based on the prior knowledge related to the mined label and the text characteristics, the cross fusion is carried out, so that the label characteristics and the text characteristics of the contact data are changed to become a batch of virtual data in the hidden space;
S4: the original cross entropy loss function is modified, the enhanced data and the original data are put into a multi-label classification model for training, the loss of virtual data in the hidden space and the loss obtained by the original data are combined through a certain ratio, the classification model is continuously perfected, and a multi-label text classification result is obtained.
Further, the S1 obtains the label co-occurrence matrix by downloading the original dataset of AAPD disclosed on the network, preprocessing, providing an example sample { S 1,S2,S3,S4 }, and a label representation of the sample under the space of the label { L 1,L2,L3,L4 }, and counting the number of mutual occurrences among the labelsThe influence of the labels on the labels is 0, and then the scoring matrix L among the labels under the sample is obtained through normalizing the rows of the matrix.
Further, the step S2 is a strategy of constructing a multi-label text classification model based on an attention mechanism, firstly, positioning 128 the data amount in the training batch before the model is transmitted, and mining out the contact data corresponding to the original text in the batch data in combination with a priori knowledge, so that the data amount in the batch is expanded to 256;
Then, word embedding processing is carried out on the input text through a word embedding module, so as to obtain embedded representation of labels and text words, a Glove word bag disclosed by Stanford university is downloaded and used, and words { w 1,w2,…,wn } in the text are converted into word vector representations x= { x 1,x2,…,xn } through a word embedding matrix and a label embedding matrix by using a 100d-Glove method, wherein x i is the word vector representation of an ith word; passing x i through an embedding matrix V ε R k×|w|, where |w| is the size of the vocabulary and k is the dimension of the embedded vector;
The text sequence x is then read from both directions using the bi-directional LSTM and the hidden representation of each word is calculated as follows:
By connecting the hidden states in two directions, the final hidden representation of the ith word is obtained Contains sequence information centered on the i-th word;
For pass attention layers, the context feature of each word is extracted using 4 multi-headed self-attention mechanisms; given a vector of a sequence A single head self-attention projects H to three different matrices: q matrix is/>K matrix is/>V matrix is/>The dimension of the output matrix is/>The scaled dot product attention is then used to obtain an output representation:
Q,K,V=HWQ,HWK,HWV
further, the S3 performs cross fusion with text features based on priori knowledge related to the mined labels, and performs data enhancement on 128 pieces of contact data and feature vectors of the labels transmitted in each batch by splitting original data and contact data in the S2 batch;
Further, the data enhancement is implemented by combining the hidden space-based data representation S 1={h1,h2,…,hn which is already obtained by the original data in the current batch and the hidden space-based data representation obtained by the corresponding contact data in a certain proportion to obtain the text data characteristic representation of the new virtual data For the tag characteristics of the hidden space corresponding to the virtual data, reserving a place where the original data is consistent with the tag in the virtual data, and inquiring the place where the original data is inconsistent with the tag in the virtual data through a tag score matrix L in the S1 to obtain the influence score of other tags of the current data on the tag, and further, constructing the tag characteristics of the virtual data through Bernoulli distributed random sampling to finally construct virtual data based on the hidden space;
further, the step S4 modifies an original cross entropy loss function, and the loss function formula is as follows:
Wherein Ω neg and Ω pos represent positive and negative sets of classes of the sample, s i is the score of the i-th class in the non-target class, s j is the score of the j-th class in the target class, and the additional class 0 score s 0 is set to have a threshold of 0.
Finally, taking the implicit representation of the original data and the virtual data as the final hidden state of the model to obtain text characteristic representation related to the self text: c= { c 1,c2,…,ck };
Further, the step S4 is to put the enhanced data and the original data into a multi-label classification model for training, and the formula is as follows:
obtaining multi-label text classification result, and losing virtual data under hidden space Loss of raw data/>Bonding is performed through the ratio lambda.
The invention has the beneficial effects that:
According to the hidden space data enhancement multi-label text classification method based on fusion label association, data in batches are encoded, training is carried out through bidirectional LSTM and attention, priori knowledge in a label list is mined, finally, a hidden space matching method is carried out on the encoded data and the obtained tag priori knowledge, virtual data under a batch of hidden spaces are constructed, and then, perfect training is carried out on a multi-label text model, so that multi-label text classification is completed;
according to the hidden space data enhancement multi-label text classification method based on the fusion label association, the relation between label structures of all texts is captured, and the problem of label imbalance in multi-label classification is effectively solved through the hidden space data enhancement method;
Compared with other deep learning models, the method has better performance on the main evaluation index micro_f1. The micro_F1 of the method reaches 72.08 percent, which is improved by 5.18 percent, 3.28 percent and 2.38 percent on the micro_F1 value compared with the traditional machine learning methods BR, CC and LP algorithm, and is improved by 3.78 percent, 2.38 percent and 1.08 percent on the micro_F1 value compared with LSTM, CNN-RNN and SGM in a neural network model.
Of course, it is not necessary for any one product to practice the invention to achieve all of the advantages set forth above at the same time.
Drawings
FIG. 1 is a model diagram of a hidden space data enhanced multi-label text classification method associated with a fusion label according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating preprocessing of data sets and label relationships according to an embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating cross-fusion of text features based on prior knowledge about mined labels according to an embodiment of the present invention;
Detailed Description
In order to more clearly describe the technical scheme of the embodiment of the present invention, the embodiment of the present invention will be described in detail below with reference to the accompanying drawings.
The hidden space data enhanced multi-label text classification method integrating label association comprises the steps of preprocessing a data set and a label relation to mine out priori knowledge of the label; constructing a multi-label text classification model based on an attention mechanism; the prior knowledge of the tag is matched with the existing data, and equivalent contact data of the tag is changed into a batch of new virtual data in the hidden space; and performing perfect training on the multi-label text model to finish multi-label text classification.
Example 1
As shown in fig. 1, the specific implementation steps of the hidden space data enhanced multi-label text classification method associated with the fusion label are as follows:
s1: preprocessing data and tags thereof in a dataset by adopting a python language programming program, and processing stop words and tags in texts so that each text and the tags thereof are stored in a csv file in a corresponding manner according to rows;
Counting the related labels and text numbers, calculating the number of times of each label appearing mutually, and finding out prior knowledge of various label relations through mining training data;
The abstract and corresponding subject matter of 55840 papers in the field of computer science are included for research purposes by downloading the original dataset of AAPD disclosed on the web. There are 54 classes of class labels in total, and the specific dataset distribution is shown in table 1.
TABLE 1 Experimental data set partitioning (Unit: bar)
The data preprocessing is realized by adopting a python language programming program, and stop words and labels in texts are processed, so that each text and the labels thereof are correspondingly stored in a csv file according to rows;
As shown in FIG. 2, S1 obtains a label co-occurrence matrix by downloading the original dataset of AAPD disclosed on the network, preprocessing, giving an example sample { S 1,S2,S3,S4 }, and a label representation of the sample under the space of the label { L 1,L2,L3,L4 }, and counting the number of mutual occurrences among the labels The influence of the labels on the labels is 0, and then the scoring matrix L among the labels under the sample is obtained through normalizing the rows of the matrix.
S2: sequentially performing word embedding and encoding on the texts, simultaneously mining contact data corresponding to the current texts in the original training batch by matching with priori knowledge, expanding the data in the original batch, and extracting the characteristics corresponding to the texts in the batch and the text characteristics related to the labels through attention layers;
the method comprises the steps of constructing a strategy of a multi-label text classification model based on an attention mechanism, firstly positioning 128 data volume in a training batch before the data is transmitted into the model, and mining out contact data corresponding to an original text in batch data in combination with priori knowledge, so that the data volume in the batch is expanded to 256;
Then, word embedding processing is carried out on the input text through a word embedding module, so as to obtain embedded representation of labels and text words, a Glove word bag disclosed by Stanford university is downloaded and used, and words { w 1,w2,…,wn } in the text are converted into word vector representations x= { x 1,x2,…,xn } through a word embedding matrix and a label embedding matrix by using a 100d-Glove method, wherein x i is the word vector representation of an ith word; passing x i through an embedding matrix V ε R k×|w|, where |w| is the size of the vocabulary and k is the dimension of the embedded vector;
The text sequence x is then read from both directions using the bi-directional LSTM and the hidden representation of each word is calculated as follows:
By connecting the hidden states in two directions, the final hidden representation of the ith word is obtained Contains sequence information centered on the i-th word;
For pass attention layers, the context feature of each word is extracted using 4 multi-headed self-attention mechanisms; given a vector of a sequence A single head self-attention projects H to three different matrices: q matrix is/>K matrix is/>V matrix is/>The dimension of the output matrix is/>The scaled dot product attention is then used to obtain an output representation:
Q,K,V=HWQ,HWK,HWV
S3: based on the prior knowledge related to the mined label and the text characteristics, the cross fusion is carried out, so that the label characteristics and the text characteristics of the contact data are changed to become a batch of virtual data in the hidden space;
as a preferred solution of the present invention, in Step3, a cross-fusion strategy is performed based on the a priori knowledge related to the mined label and the text feature, as shown in fig. 3:
Based on the prior knowledge related to the mined label and text characteristics, carrying out cross fusion, and carrying out data enhancement on 128 pieces of contact data and feature vectors of the label which are transmitted into each batch by cutting the original data and the contact data in S2 batch;
Further, the data enhancement is implemented by combining the hidden space-based data representation S 1={h1,h2,…,hn which is already obtained by the original data in the current batch and the hidden space-based data representation obtained by the corresponding contact data in a certain proportion to obtain the text data characteristic representation of the new virtual data For the tag characteristics of the hidden space corresponding to the virtual data, reserving a place where the original data is consistent with the tag in the virtual data, and inquiring the place where the original data is inconsistent with the tag in the virtual data through a tag score matrix L in the S1 to obtain the influence score of other tags of the current data on the tag, and further, constructing the tag characteristics of the virtual data through Bernoulli distributed random sampling to finally construct virtual data based on the hidden space;
S4: the original cross entropy loss function is modified, the enhanced data and the original data are put into a multi-label classification model for training, the loss of virtual data in the hidden space and the loss obtained by the original data are combined through a certain ratio, the classification model is continuously perfected, and a multi-label text classification result is obtained.
S4, modifying an original cross entropy loss function, wherein a loss function formula is as follows:
Wherein Ω neg and Ω pos represent positive and negative sets of classes of the sample, s i is the score of the i-th class in the non-target class, s j is the score of the j-th class in the target class, and the additional class 0 score s 0 is set to have a threshold of 0.
Finally, taking the implicit representation of the original data and the virtual data as the final hidden state of the model to obtain text characteristic representation related to the self text: c= { c 1,c2,…,ck };
Further, the step S4 is to put the enhanced data and the original data into a multi-label classification model for training, and the formula is as follows:
obtaining multi-label text classification result, and losing virtual data under hidden space Loss of raw data/>Bonding is performed through the ratio lambda.
Example 2
Based on the hidden space data enhancement multi-label text classification method related to the fusion label in the embodiment 1 of the invention, the data imbalance in multi-label text classification is taken as a problem, a hidden space data enhancement multi-label text classification model related to the fusion label is designed through hidden space data enhancement, and the model is compared with the following baseline model:
BR (Binary Relevance): the multi-label classification task is converted into a plurality of single-label classification problems by ignoring the correlation between labels and building a separate classifier for each label.
CC (Classifier Chains): the algorithm converts the multi-tag classification task into a binary classification problem chain in which subsequent binary classifiers based on previous predictions affect subsequent tags if the previous tag prediction is incorrect and take into account higher order tag correlation.
LP (Label Powerset): the multi-label problem is converted into a multi-class problem, and all unique label combinations are trained by using one multi-class classifier.
CNN-RNN: the model fuses the CNN and the RNN, firstly sends word vectors into the CNN to obtain text feature sequences, and then inputs the features into the RNN to obtain corresponding prediction labels. The CNN and RNN are utilized to capture global and local text semantics and model tag relevance.
LSTM: a long and short term memory network is applied to take into account the sequential structure of the text and to alleviate the problem of explosion and vanishing gradients.
SGM: the multi-tag classification task is regarded as a sequence generation problem, the correlation between tags is considered using the generated ideas, and seq2seq is used as a multi-class classifier.
The results and analysis of the multi-labeled text classification experiment based on AAPD (abstract and corresponding topic of 55840 papers included in the field of computer science) datasets were as follows:
table 2 multi-tag text classification test results
Table 2 shows that the proposed method has better performance on the main evaluation index micro_F1 compared to other deep learning models. The micro_F1 of the method reaches 72.08 percent, which is improved by 5.18 percent, 3.28 percent and 2.38 percent on the micro_F1 value compared with the traditional machine learning methods BR, CC and LP algorithm, and is improved by 3.78 percent, 2.38 percent and 1.08 percent on the micro_F1 value compared with LSTM, CNN-RNN and SGM in a neural network model.
According to the hidden space data enhancement multi-label text classification method based on fusion label association, data in batches are encoded, training is carried out through bidirectional LSTM and attention, priori knowledge in a label list is mined, finally, a hidden space matching method is carried out on the encoded data and the obtained tag priori knowledge, virtual data under a batch of hidden spaces are constructed, and then, perfect training is carried out on a multi-label text model, so that multi-label text classification is completed;
according to the hidden space data enhancement multi-label text classification method based on the fusion label association, the relation between label structures of all texts is captured, and the problem of label imbalance in multi-label classification is effectively solved through the hidden space data enhancement method;
In the description of the present specification, the descriptions of the terms "one embodiment," "example," "specific example," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The preferred embodiments of the invention disclosed above are intended only to assist in the explanation of the invention. The preferred embodiments are not exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. The invention is limited only by the claims and the full scope and equivalents thereof.

Claims (3)

1. The hidden space data enhanced multi-label text classification method based on fusion label association is characterized in that: preprocessing a data set and a label relation to mine out priori knowledge of the label; constructing a multi-label text classification model based on an attention mechanism; the prior knowledge of the tag is matched with the existing data, and equivalent contact data of the tag is changed into a batch of new virtual data in the hidden space; performing perfect training on the multi-label text model to finish multi-label text classification; the method specifically comprises the following steps:
S1: preprocessing data and tags thereof in a dataset by adopting a python language programming program, and processing stop words and tags in texts so that each text and the tags thereof are stored in a csv file in a corresponding manner according to rows; counting the related labels and text numbers, calculating the number of times of each label appearing mutually, and finding out prior knowledge of various label relations through mining training data; by downloading AAPD of the original dataset disclosed on the network, preprocessing, and then obtaining a label co-occurrence matrix by giving an example sample { S 1,S2,S3,S4 }, and a label representation of the sample under the space of labels { L 1,L2,L3,L4 }, and counting the number of mutual occurrences among the labels The influence of the labels on the labels is 0, and then the scoring matrix L among the labels under the sample is obtained through normalizing the rows of the matrix;
S2: sequentially performing word embedding and encoding on the texts, simultaneously mining contact data corresponding to the current texts in the original training batch by matching with priori knowledge, expanding the data in the original batch, and extracting the characteristics corresponding to the texts in the batch and the text characteristics related to the labels through attention layers; the method comprises the steps of constructing a strategy of a multi-label text classification model based on an attention mechanism, firstly positioning 128 data volume in a training batch before the data is transmitted into the model, and mining out contact data corresponding to an original text in batch data in combination with priori knowledge, so that the data volume in the batch is expanded to 256;
Then, word embedding processing is carried out on the input text through a word embedding module, so as to obtain embedded representation of labels and text words, a Glove word bag disclosed by Stanford university is downloaded and used, and words { w 1,w2,…,wn } in the text are converted into word vector representations x= { x 1,x2,…,xn } through a word embedding matrix and a label embedding matrix by using a 100d-Glove method, wherein x i is the word vector representation of an ith word; and passing x i through an embedding matrix Where |w| is the size of the vocabulary, k is the dimension of the embedded vector;
The text sequence x is then read from both directions using the bi-directional LSTM and the hidden representation of each word is calculated as follows:
By connecting the hidden states in two directions, the final hidden representation of the ith word is obtained Contains sequence information centered on the i-th word;
For pass attention layers, the context feature of each word is extracted using 4 multi-headed self-attention mechanisms; given a vector of a sequence A single head self-attention projects H to three different matrices: q matrix is/>K matrix is/>V matrix is/>The dimension of the output matrix is/>The scaled dot product attention is then used to obtain an output representation:
S3: based on the prior knowledge related to the mined label and the text characteristics, the cross fusion is carried out, so that the label characteristics and the text characteristics of the contact data are changed to become a batch of virtual data in the hidden space; the method specifically comprises the steps of carrying out cross fusion on the basis of priori knowledge related to the mined labels and text characteristics, and carrying out data enhancement on 128 pieces of contact data and feature vectors of the labels which are transmitted into each batch by cutting the original data and the contact data in S2 batch; the data enhancement is represented by hidden space based data that has been obtained for raw data in the current batch Combining the hidden space-based data representations obtained by the corresponding contact data in a certain proportion to obtain the text data characteristic representation/>, of the new virtual dataFor the tag characteristics of the hidden space corresponding to the virtual data, reserving a place where the original data is consistent with the tag in the virtual data, and inquiring the place where the original data is inconsistent with the tag in the virtual data through a tag score matrix L in the S1 to obtain the influence score of other tags of the current data on the tag, and further, constructing the tag characteristics of the virtual data through Bernoulli distributed random sampling to finally construct virtual data based on the hidden space;
S4: the original cross entropy loss function is modified, the enhanced data and the original data are put into a multi-label classification model for training, the loss of virtual data in the hidden space and the loss obtained by the original data are combined through a certain ratio, the classification model is continuously perfected, and a multi-label text classification result is obtained.
2. The fusion tag-associated hidden space data enhanced multi-tag text classification method of claim 1, wherein: and S4, modifying an original cross entropy loss function, wherein a loss function formula is as follows:
Wherein Ω neg and Ω pos represent positive and negative class sets of the sample, s i is the score of the i-th class in the non-target class, s j is the score of the j-th class in the target class, and for the additional class 0 score s 0, the threshold is made to be 0;
Finally, taking the implicit representation of the original data and the virtual data as the final hidden state of the model to obtain text characteristic representation related to the self text: c= { c 1,c2,…,ck }.
3. The fusion tag-associated hidden space data enhanced multi-tag text classification method of claim 2, wherein: and S4, the enhanced data and the original data are put into a multi-label classification model for training, and the formula is as follows:
obtaining multi-label text classification result, and losing virtual data under hidden space Loss of raw dataBonding is performed through the ratio lambda.
CN202210679320.9A 2022-06-15 2022-06-15 Hidden space data enhanced multi-label text classification method based on fusion label association Active CN115080689B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210679320.9A CN115080689B (en) 2022-06-15 2022-06-15 Hidden space data enhanced multi-label text classification method based on fusion label association

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210679320.9A CN115080689B (en) 2022-06-15 2022-06-15 Hidden space data enhanced multi-label text classification method based on fusion label association

Publications (2)

Publication Number Publication Date
CN115080689A CN115080689A (en) 2022-09-20
CN115080689B true CN115080689B (en) 2024-05-07

Family

ID=83253870

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210679320.9A Active CN115080689B (en) 2022-06-15 2022-06-15 Hidden space data enhanced multi-label text classification method based on fusion label association

Country Status (1)

Country Link
CN (1) CN115080689B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115795037B (en) * 2022-12-26 2023-10-20 淮阴工学院 Multi-label text classification method based on label perception

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104899253A (en) * 2015-05-13 2015-09-09 复旦大学 Cross-modality image-label relevance learning method facing social image
CN109376239A (en) * 2018-09-29 2019-02-22 山西大学 A kind of generation method of the particular emotion dictionary for the classification of Chinese microblog emotional
CN111695052A (en) * 2020-06-12 2020-09-22 上海智臻智能网络科技股份有限公司 Label classification method, data processing device and readable storage medium
CN113312480A (en) * 2021-05-19 2021-08-27 北京邮电大学 Scientific and technological thesis level multi-label classification method and device based on graph convolution network
CN113626589A (en) * 2021-06-18 2021-11-09 电子科技大学 Multi-label text classification method based on mixed attention mechanism
CN113806645A (en) * 2020-06-12 2021-12-17 上海智臻智能网络科技股份有限公司 Label classification system and training system of label classification model
CN113806547A (en) * 2021-10-15 2021-12-17 南京大学 Deep learning multi-label text classification method based on graph model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104899253A (en) * 2015-05-13 2015-09-09 复旦大学 Cross-modality image-label relevance learning method facing social image
CN109376239A (en) * 2018-09-29 2019-02-22 山西大学 A kind of generation method of the particular emotion dictionary for the classification of Chinese microblog emotional
CN111695052A (en) * 2020-06-12 2020-09-22 上海智臻智能网络科技股份有限公司 Label classification method, data processing device and readable storage medium
CN113806645A (en) * 2020-06-12 2021-12-17 上海智臻智能网络科技股份有限公司 Label classification system and training system of label classification model
CN113312480A (en) * 2021-05-19 2021-08-27 北京邮电大学 Scientific and technological thesis level multi-label classification method and device based on graph convolution network
CN113626589A (en) * 2021-06-18 2021-11-09 电子科技大学 Multi-label text classification method based on mixed attention mechanism
CN113806547A (en) * 2021-10-15 2021-12-17 南京大学 Deep learning multi-label text classification method based on graph model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Multi-label topic model conditioned on label embedding;Lin Tang等;2019 IEEE International conference on computer science and educational;20190819;1-10 *
基于网络表示的半监督问答文本情感分类方法;陈潇;李逸薇;刘欢;李寿山;;郑州大学学报(理学版);20200522;第52卷(第02期);52-58 *
融合标签关联的隐空间数据增强多标签文本分类方法;苗育华等;现代电子技术;20231212;第46卷(第24期);159-164 *

Also Published As

Publication number Publication date
CN115080689A (en) 2022-09-20

Similar Documents

Publication Publication Date Title
CN109189925B (en) Word vector model based on point mutual information and text classification method based on CNN
CN110717431B (en) Fine-grained visual question and answer method combined with multi-view attention mechanism
Wang et al. Weakly supervised patchnets: Describing and aggregating local patches for scene recognition
CN112069811B (en) Electronic text event extraction method with multi-task interaction enhancement
CN112732916B (en) BERT-based multi-feature fusion fuzzy text classification system
Mohamed et al. Content-based image retrieval using convolutional neural networks
Tang et al. Learning multi-instance deep discriminative patterns for image classification
CN111159485B (en) Tail entity linking method, device, server and storage medium
Bergamo et al. Classemes and other classifier-based features for efficient object categorization
CN111859983B (en) Natural language labeling method based on artificial intelligence and related equipment
CN114896388A (en) Hierarchical multi-label text classification method based on mixed attention
Varga et al. Fast content-based image retrieval using convolutional neural network and hash function
CN113254655B (en) Text classification method, electronic device and computer storage medium
Zhang et al. Large-scale aerial image categorization using a multitask topological codebook
Chen et al. Visual-based deep learning for clothing from large database
CN115080689B (en) Hidden space data enhanced multi-label text classification method based on fusion label association
Wu et al. TDv2: a novel tree-structured decoder for offline mathematical expression recognition
CN113806493A (en) Entity relationship joint extraction method and device for Internet text data
CN115982403A (en) Multi-mode hash retrieval method and device
Ha et al. Correlation-based deep learning for multimedia semantic concept detection
CN114153942A (en) Event time sequence relation extraction method based on dynamic attention mechanism
Rani et al. Visual recognition and classification of videos using deep convolutional neural networks
Athavale et al. Predicting algorithm classes for programming word problems
CN117216617A (en) Text classification model training method, device, computer equipment and storage medium
Xia Label oriented hierarchical attention neural network for short text classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant