CN114896388A

CN114896388A - A Hierarchical Multi-Label Text Classification Method Based on Mixed Attention

Info

Publication number: CN114896388A
Application number: CN202210216140.7A
Authority: CN
Inventors: 马小林; 钟港; 旷海兰; 刘新华
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2022-03-07
Filing date: 2022-03-07
Publication date: 2022-08-12
Anticipated expiration: 2042-03-07
Also published as: CN114896388B

Abstract

The invention provides a hierarchical multi-label text classification method based on mixed attention, which uses pre-trained word vectors as word embedding and uses Bi-GRU to perform primary feature extraction on the input word embedding; modeling a hierarchical label structure system by using a graph convolution neural network, and generating a label representation containing label relevance; local feature extraction with different granularities is further carried out on the output of the Bi-GRU by using a plurality of convolution neural networks with different convolution kernel sizes, the output is spliced into a text feature after being subjected to maximum pooling, and the text feature is further subjected to feature extraction by using attention expressed based on labels; and meanwhile, performing global feature extraction on the output of the Bi-GRU by using a self-attention mechanism, performing self-adaptive fusion on the text features expressed based on the labels and the text features expressed based on the self-attention mechanism to obtain text expression based on mixed attention, extracting information among the labels through a relation network, and obtaining a final classification result through a multilayer perceptron.

Description

A Hierarchical Multi-Label Text Classification Method Based on Mixed Attention

技术领域technical field

本发明涉及计算机信息技术领域与自然语言处理领域，更具体地，涉及一种基于混合注意力的层级多标签文本分类方法。The present invention relates to the field of computer information technology and natural language processing, and more particularly, to a method for hierarchical multi-label text classification based on mixed attention.

背景技术Background technique

互联网时代的到来，使人们能够更加便捷地接触到各类信息，与此同时，各种媒体数据也在源源不断地产生，这为挖掘互联网上有价值的数据提供了基础条件，如果对海量的数据缺乏高效的管理方式与获取知识的手段，无疑是一种浪费。在数据挖掘中，文本分类问题是核心的问题之一。The advent of the Internet era has enabled people to access various types of information more conveniently. At the same time, various media data are also continuously generated, which provides the basic conditions for mining valuable data on the Internet. The lack of efficient data management methods and means of acquiring knowledge is undoubtedly a waste. In data mining, text classification is one of the core problems.

多标签文本分类的任务是在给定的标签集合中选择与文本内容最相关的子集。在实际场景中，很多数据与标签集合中的多个标签相关，这些标签可以简洁地展现出数据的具体内容，使人们能够更加方便有效地管理海量数据，对数据进一步分析。层级多标签文本分类是多标签文本分类的一种特例，其标签体系具有层次化的结构，通用的多标签文本分类算法没有考虑到层次化标签结构对分类效果的影响，没有充分利用到文本标签间的关联信息，导致进行分类时候对其所属类别识别不够准确，特别是对存在长尾分布的数据，其分类效果仍有较大的提升空间。同时，现有模型大多关注于文本的局部特征，或者全局特征，缺乏对局部特征及全局特征综合考虑，以至于对涉及分类的重要特征捕捉不足。The task of multi-label text classification is to select the subset most relevant to the text content within a given set of labels. In practical scenarios, a lot of data is related to multiple tags in the tag set, and these tags can concisely display the specific content of the data, enabling people to manage massive data more conveniently and effectively, and further analyze the data. Hierarchical multi-label text classification is a special case of multi-label text classification. Its label system has a hierarchical structure. The general multi-label text classification algorithm does not take into account the impact of the hierarchical label structure on the classification effect, and does not make full use of text labels. The correlation information between the data leads to inaccurate identification of the category to which it belongs during classification, especially for data with long-tailed distribution, the classification effect still has a large room for improvement. At the same time, most of the existing models focus on the local features or global features of the text, and lack comprehensive consideration of local features and global features, so that the important features involved in classification are insufficiently captured.

发明内容SUMMARY OF THE INVENTION

针对现有技术的不足，本发明提供一种基于混合注意力的层级多标签文本分类方法，通过利用标签层次结构进行标签语义表示，以及充分利用文本的全局及局部语义信息来达到改善层级多标签文本分类的性能问题。Aiming at the deficiencies of the prior art, the present invention provides a mixed attention-based hierarchical multi-label text classification method, which improves the hierarchical multi-label by using the label hierarchy for label semantic representation and making full use of the global and local semantic information of the text. Performance issues for text classification.

为了实现上述目的，本发明所采用的技术方案为：一种基于混合注意力的层级多标签文本分类方法，该方法包括如下步骤：In order to achieve the above-mentioned purpose, the technical solution adopted in the present invention is: a method for classifying multi-label text based on mixed attention, the method comprises the following steps:

S1，多标签文本数据预处理；文本数据用于训练模型，它由文本内容以及其对应的标签集构成，整个数据集所有的标签类别之间是一个树状图，具有层级关系，对于树图而言，它是由很多个节点构成，每一个节点代表一个标签类别，数据集中每个样例文本对应的标签来自于这个标签树图上的节点；S1, multi-label text data preprocessing; text data is used to train the model, which consists of text content and its corresponding label set. All label categories in the entire data set are a dendrogram with a hierarchical relationship. For the dendrogram In terms of it, it consists of many nodes, each node represents a label category, and the label corresponding to each sample text in the dataset comes from the node on this label tree graph;

S2，针对文本标签，获取层级分类体系的先验层级信息，所述先验层级信息指的是标签之间互相依赖的先验概率，可以通过计算父标签与子标签之间的转移概率得到；S2, for the text label, obtain the prior level information of the hierarchical classification system, the prior level information refers to the prior probability of the interdependence between the labels, which can be obtained by calculating the transition probability between the parent label and the child label;

S3，构建深度学习层级多标签文本分类模型；S3, build a deep learning hierarchical multi-label text classification model;

所述深度学习多标签文本分类模型包括词嵌入模块、文本编码模块、标签编码模块，基于标签注意力机制文本表示模块、基于自注意力机制的文本表示模块、特征融合模块、向量回归层，关系网络模块，标签概率预测层；The deep learning multi-label text classification model includes a word embedding module, a text encoding module, a label encoding module, a text representation module based on a label attention mechanism, a text representation module based on a self-attention mechanism, a feature fusion module, a vector regression layer, and a relationship. Network module, label probability prediction layer;

S4，输入数据集预处理后的文本数据到模型训练；模型训练完成之后，利用训练好的模型对多标签文本进行分类。S4, input the preprocessed text data of the data set to the model training; after the model training is completed, use the trained model to classify the multi-label text.

上述技术方案中，步骤S1包括，对数据集D中样本进行数据预处理，具体包括如下步骤：步骤1.1，进行分词、去除停止词、去除标点符号；步骤1.2，统计数据集D中的文本中的单词频率word_frequence，删除出现次数小于X1的单词，将过滤后的单词记录，构建单词表。数据集D经过预处理后，将数据集D按一定比例划分为训练集，验证集，测试集。In the above technical solution, step S1 includes performing data preprocessing on the samples in the data set D, which specifically includes the following steps: Step 1.1, performing word segmentation, removing stop words, and removing punctuation marks; The word frequency word_frequence, delete words whose occurrences are less than X1, record the filtered words, and build a word list. After the data set D is preprocessed, the data set D is divided into a training set, a validation set, and a test set according to a certain proportion.

上述技术方案中，步骤S2包括：对于数据集D中的训练集的数据，假设父节点v_i和子节点v_j之间存在层次路径e_i,j，那么由父子节点路径构成的边的特征f(e_i,j)由先验概率p(U_j|U_i)和p(U_i|U_j)表示：In the above technical solution, step S2 includes: for the data of the training set in the data set D, assuming that there is a hierarchical path e _i, _j between the parent node v _i and the child node v j , then the feature f of the edge formed by the parent-child node path (e _i,j ) is represented by the prior probabilities p(U _j |U _i ) and p(U _i |U _j ):

f(e_i,j)表示的是两个节点的关系，这种关系用两个节点的转移概率或者说共现概率来描述，两个节点的转移概率分别包括父节点到某一个子节点的转移概率p(U_j|U_i)，子节点到父节点的转移概率p(U_i|U_j)，一个父标签节点下可能会包含多个子标签节点，父标签节点到每一个子标签节点转移概率的和为1，假如父标签节点下只有一个子节点，那么值此时为1；假如存在多个子标签，那么这个值此时是小于1，但是它们的和为1；式中，U_j和U_i分别表示文本样例被标记为v_j节点标签及被标记为v_i节点标签，p(U_j|U_i)是给定v_i情况下被标记为v_j节点标签的条件概率。P(U_j∩U_i)是{v_j,v_i}同时被标记的概率。N_j和N_i分别表示训练集中v_j节点标签及v_i节点标签的数量。f(e _i,j ) represents the relationship between two nodes. This relationship is described by the transition probability or co-occurrence probability of the two nodes. The transition probability of the two nodes includes the relationship between the parent node and a child node. The transition probability p(U _j |U _i ), the transition probability p(U _i |U _j ) from the child node to the parent node, a parent label node may contain multiple child label nodes, the parent label node to each child label node The sum of the transition probabilities is 1. If there is only one child node under the parent label node, the value is 1 at this time; if there are multiple child labels, the value is less than 1 at this time, but their sum is 1; in the formula, U _j and U _i denote that the text sample is labeled as v _j node label and labeled as v _i node label, respectively, p(U _j |U _i ) is the conditional probability of being labeled as v _j node label given v _i . P(U _j ∩U _i ) is the probability that {v _j ,v _i } are marked simultaneously. N _j and N _i represent the number of v _j node labels and v _i node labels in the training set, respectively.

上述技术方案中，步骤3还包括通过词嵌入模块对输入文本及其标签进行词嵌入处理，词嵌入处理方法具体为：In the above technical solution, step 3 further includes performing word embedding processing on the input text and its tags through the word embedding module, and the word embedding processing method is specifically:

步骤2.1、获得预处理后的文本序列，通过查询词嵌入字典表将文本中的单词{x₁,x₂,...,x_n}转换为词向量表示{w₁,w₂,...,w_n}，n指的是预处理后的文本的单词数量。Step 2.1. Obtain the preprocessed text sequence, and convert the words {x ₁ ,x ₂ ,...,x _n } in the text into word vector representations {w ₁ ,w ₂ ,... by querying the word embedding dictionary table. .,w _n }, where n refers to the number of words in the preprocessed text.

步骤2.2、获得层级多标签文本分类的标签集{l₁,l₂,...,l_n}，通过kaiming编码的方式，将标签集转换成一个维度为d的标签嵌入集{c₁,c₂,...,c_n}。Step 2.2. Obtain the label set {l ₁ ,l ₂ ,...,l _n } of hierarchical multi-label text classification, and convert the label set into a label embedding set {c ₁ , c ₂ ,...,c _n }.

上述技术方案中，步骤S3还包括，通过文本编码模块对词向量表示{w₁,w₂,...,w_n}进行编码处理，编码处理方法具体为：In the above technical solution, step S3 further includes, performing encoding processing on the word vector representation {w ₁ , w ₂ , . . . , _wn } through the text encoding module, and the encoding processing method is specifically:

使用Bi-GRU网络对文本的词向量表示{w₁,w₂,...,w_n}进行编码，生成具有上下文语义信息的隐含表示{h₁,h₂,...,h_n}。然后将隐含表示{h₁,h₂,...,h_n}进一步送入三个卷积核大小不同的卷积，并得到三个不同感受野下的语义向量，最后将3个语义向量拼接成一个新的语义表示向量S＝{s₁，s₂,…,s_n}。The word vector representations {w ₁ ,w ₂ ,...,w _n } of the text are encoded using the Bi-GRU network to generate implicit representations {h ₁ ,h ₂ ,...,h _n with contextual semantic information }. Then, the implicit representation {h ₁ , h ₂ ,...,h _n } is further sent to three convolutions with different kernel sizes, and three semantic vectors under different receptive fields are obtained. Finally, the three semantic The vectors are concatenated into a new semantic representation vector S={s ₁ , s ₂ ,...,s _n }.

上述技术方案中，步骤S3还包括，通过标签编码模块对标签向量表示{c₁,c₂,...,c_n}进行编码处理，标签编码处理方法具体为：In the above technical solution, step S3 further includes: performing encoding processing on the label vector representation {c ₁ ,c ₂ ,...,c _n } by the label encoding module, and the label encoding processing method is specifically:

使用单层GCN对标签向量表示{c₁,c₂,...,c_n}进行编码，生成具有标签层次关联信息的隐含表示M＝{m₁,m₂,...,m_n}。其实现过程如下：The label vector representation {c ₁ ,c ₂ ,...,c _n } is encoded using a single-layer GCN to generate an implicit representation M={m ₁ ,m ₂ ,...,m _n with label-level association information }. The implementation process is as follows:

层次结构GCN聚合了自上而下、自下而上和自循环边缘内的数据流。在层次GCN中，每个有向边代表一个成对的标签相关特征，这些数据流使用沿边线性变换进行节点变换。Hierarchical GCNs aggregate top-down, bottom-up, and self-loop data flows within edges. In hierarchical GCNs, each directed edge represents a pair of label-related features, and these data streams are transformed at nodes using linear transformations along the edges.

为了实现节点变换，本发明使用了加权邻接矩阵来表示这种线性变换，而加权邻接矩阵的初始值来自于步骤二中层级分类体系的先验层级信息。形式上，层次GCN根据节点k的相关邻域对其隐藏状态进行编码，其中邻域N(k)＝{n_k,child(k),parent(k)}，n_k指的是层级标签树中的第k个标签节点，child(k)是指第k个节点的子标签节点，parent(k)指的是第k个节点的父标签节点，节点k的隐藏状态计算方式如下：In order to realize the node transformation, the present invention uses a weighted adjacency matrix to represent this linear transformation, and the initial value of the weighted adjacency matrix comes from the prior level information of the hierarchical classification system in step two. Formally, a hierarchical GCN encodes its hidden state according to the relevant neighborhood of node k, where the neighborhood N(k) = {n _k , child(k), parent(k)}, n _k refers to the hierarchical label tree The kth label node in , child(k) refers to the child label node of the kth node, parent(k) refers to the parent label node of the kth node, and the hidden state of node k is calculated as follows:

上述公式中，v_j，v_k是可以训练的参数，

及

是可训练的偏置参数；对于u_k,j及g_k,j而言，可以将u_k,j理解成结点k，j之间的信息，g_k,j理解成门控值，控制u_k,j最后对In the above formula, v _j , v _k are parameters that can be trained,

and

is a trainable bias parameter; for uk _,j and gk _,j , uk _,j can be understood as the information between nodes k, j, _gk,j as the gate value, control u _k,j last pair

节点k的影响大小；σ是指深度学习中的激活函数，可以取为sigmoid函数，

b_l∈R^N×dim，及b_g∈R^N，dim为向量的维度大小，属于预先定义的超参数。d(j，k)表示从节点j到节点k的层次方向，包括自顶向下、自下而上和自循环边。其中，a_k,j∈R表示的是层次概率f_d(k,j)(e_kj)，f_d(k,j)(e_kj)指的是从第k个节点到第j个标签节点间的转移概率。它是通过上文中的f(e_(i,j))得到，自循环边采用a_k,k＝1，自上而下的边使用

自下而上的边使用f_p(e_j,k)＝1。上述边的特征矩阵F＝{a_0,0,a_0,1,…,a_c-1,c-1}表示的是文本标签有向层次图的加权邻接矩阵。最后，节点k的输出隐藏状态h_k表示其对应于层次结构信息的标签表示。The influence of node k; σ refers to the activation function in deep learning, which can be taken as the sigmoid function,

b _l ∈R ^N×dim , and b _g ∈R ^N , dim is the dimension of the vector and belongs to the predefined hyperparameters. d(j, k) represents the hierarchical direction from node j to node k, including top-down, bottom-up and self-loop edges. Among them, a _k,j ∈R represents the hierarchical probability f _d(k,j) (e _kj ), and f _d(k,j) (e _kj ) refers to the node from the kth node to the jth label node the transition probability between. It is obtained by f(e_(i,j)) above, using a _k,k = 1 for the self-loop edge, and using the top-down edge

Bottom-up edges use f _p (e _j,k )=1. The feature matrix F={a _0,0 ,a _0,1 ,..., _ac-1,c-1 } of the above edge represents the weighted adjacency matrix of the directed hierarchical graph of text labels. Finally, the output hidden state h _k of node k represents its label representation corresponding to the hierarchical information.

上述技术方案中，步骤S3还包括基于标签注意力机制文本表示模块：对来自文本编码层的文本表示

以及来自标签编码层的标签表示

d_c表示的是文本编码向量的维度大小，是一个预先确定的固定的值，通过以下公式计算基于标签注意力的文本表示：In the above technical solution, step S3 also includes a text representation module based on the label attention mechanism: representing the text from the text encoding layer

and the label representation from the label encoding layer

d _c represents the dimension size of the text encoding vector, which is a predetermined fixed value. The text representation based on label attention is calculated by the following formula:

其中，α_kj表示第j个文本特征向量对于第k个标签的信息量。v_k即基于标签注意的文本表示。Among them, α _kj represents the information amount of the j-th text feature vector for the k-th label. v _k is the text representation based on label attention.

上述技术方案中，步骤S3还包括基于自注意力机制文本表示模块：对来自文本编码层中Bi-GRU输出的隐层文本表示

通过以下公式计算基于自注意力机制的文本表示：In the above technical solution, step S3 further includes a text representation module based on a self-attention mechanism: representing the hidden layer text output from the Bi-GRU in the text encoding layer

The text representation based on the self-attention mechanism is calculated by the following formula:

其中w₁，w₂为参数，h为文本表示，α_kt为文本表示中第t个向量所占的权重，u_k为基于自注意力机制的文本表示。Where w ₁ , w ₂ are parameters, h is the text representation, α _kt is the weight occupied by the t-th vector in the text representation, and _uk is the text representation based on the self-attention mechanism.

上述技术方案中，步骤S3还包括特征融合模块：将与基于标签注意力机制的文本特征和基于自注意力机制的文本特征进行自适应融合，得到最终的文本特征d_ik-fusio，计算方式如下：In the above technical solution, step S3 also includes a feature fusion module: adaptively fuse with the text feature based on the label attention mechanism and the text feature based on the self-attention mechanism to obtain the final text feature d _ik-fusio , the calculation method is as follows :

其中w₁，w₂为参数，v_k为基于标签注意力的文本表示，u_k为基于自注意力的文本表示，β_k为v_k所占的权重。where w ₁ and w ₂ are parameters, v _k is the text representation based on label attention, _uk is the text representation based on self-attention, and β _k is the weight occupied by v _k .

上述技术方案中，步骤S3还包括关系网络模块对标签间的关联信息进一步挖掘：挖掘方法为将特征融合模块产生的文本特征d_ik-fusion输入到全连接层，得到每标签对应的logits向量O＝{o₁,o₂,...,o_n}，然后将向量O输入到关系网络模块得到预测向量y＝{y₁,y₂,...,y_n}，最后将预测向量y输入到多层感知机，即能得到标签预测概率，其中关系网络本质是一个残差网络。In the above technical solution, step S3 also includes the relationship network module further mining the association information between the tags: the mining method is to input the text feature d _ik-fusion generated by the feature fusion module into the fully connected layer, and obtain the logits vector corresponding to each tag O. = _{ o ₁ ,o ₂ ,...,on }, then input the vector O to the relational network module to get the prediction vector y={y ₁ ,y ₂ ,...,y _n }, and finally the prediction vector y Input to the multi-layer perceptron, the label prediction probability can be obtained, and the relationship network is essentially a residual network.

上述技术方案中，步骤S4包括训练过程中需要使用采用交叉熵损失函数，使用Adam优化器进行训练，多标签文本分类的交叉熵损失函数如下：In the above technical solution, step S4 includes the need to use a cross-entropy loss function in the training process, and use the Adam optimizer for training. The cross-entropy loss function for multi-label text classification is as follows:

其中，y_ij为第i个样本对于第j个标签的实际概率，

为第i个样本对于第j个标签的预测概率，最终得到训练好的深度学习多标签文本分类模型，L指的是标签类别的数量，N指的是样例文本的数量。Among them, y _ij is the actual probability of the i-th sample for the j-th label,

is the predicted probability of the i-th sample for the j-th label, and finally a trained deep learning multi-label text classification model is obtained, where L refers to the number of label categories, and N refers to the number of sample texts.

本发明的优点和有益效果如下：The advantages and beneficial effects of the present invention are as follows:

本发明使用了Bi-GRU联合CNN来提取文本的语义表示，能够较为充分的获得文本的局部语义信息。本发明通过图神经网络来表征层级多标签分类的层级信息，能够获得具有层级关联信息的标签表示。本发明使用了自注意力机制来提取文本的语义表示，能够获得文本全局关联的语义表示。本发明使用了自适应融合基于标签表示的文本特征与基于自注意力表示的文本特征，能够获得联系全局、局部文本，及标签息的文本表示。本发明在模型的最后一层使用了关系网络，能够使得原始标签预测向量进一步的获得标签关联性。The present invention uses Bi-GRU combined with CNN to extract the semantic representation of the text, which can sufficiently obtain the local semantic information of the text. In the present invention, the hierarchical information of hierarchical multi-label classification is represented by the graph neural network, and the label representation with hierarchical correlation information can be obtained. The present invention uses the self-attention mechanism to extract the semantic representation of the text, and can obtain the semantic representation of the global association of the text. The present invention uses the self-adaptive fusion of the text feature based on label representation and the text feature based on self-attention representation, and can obtain text representation that links global, local text and label information. The present invention uses a relational network in the last layer of the model, so that the original label prediction vector can further obtain the label correlation.

本发明包括四个方面：一是使用图卷积神经网络来提取蕴含层级关系的标签表示；二是使用了多个不同粒度的卷积提取了局部特征；三是使用基于标签的注意力机制和基于自注意力机制进一步提取文本特征，并对其自适应融合(FA)。四是使用了关系网络进一步提取标签关联性。本发明提供的一种基于混合注意力的层级多标签分类方法，通过对输入的待分类的文本进行文本特征提取，然后通过多层感知机分类，能对该文本标记一个或多个标签，可广泛应用于电商、新闻、科技论文等领域。The present invention includes four aspects: firstly, using graph convolutional neural network to extract label representations containing hierarchical relationships; secondly, using multiple convolutions with different granularities to extract local features; thirdly, using label-based attention mechanism and Based on the self-attention mechanism, the text features are further extracted and adaptively fused (FA). The fourth is to use the relational network to further extract label associations. A method for hierarchical multi-label classification based on mixed attention provided by the present invention can mark the text with one or more labels by extracting text features from the input text to be classified, and then classifying the text through a multi-layer perceptron. Widely used in e-commerce, news, scientific papers and other fields.

附图说明Description of drawings

图1为本发明基于混合注意力的层级多标签文本分类方法流程图；Fig. 1 is the flow chart of the hierarchical multi-label text classification method based on mixed attention of the present invention;

图2为本发明基于混合注意力的层级多标签文本分类模型网络结构图；Fig. 2 is the network structure diagram of the hierarchical multi-label text classification model based on mixed attention of the present invention;

图3为本发明层级多标签文本分类标签层级结构示意图；3 is a schematic diagram of the hierarchical structure of the hierarchical multi-label text classification label of the present invention;

图4为本发明层级多标签文本分类图卷积神经网络计算示意图；4 is a schematic diagram of calculation of a hierarchical multi-label text classification graph convolutional neural network according to the present invention;

图5为本发明关系网络示意图。FIG. 5 is a schematic diagram of the relationship network of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。此外，下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not conflict with each other.

步骤S1，对数据集D中的多标签文本数据进行预处理；Step S1, preprocessing the multi-label text data in the dataset D;

步骤S2，针对文本标签，获取层级分类体系的先验层级信息，所述先验层级信息指的是标签之间互相依赖的先验概率，可以通过计算父标签与子标签之间的转移概率得到；Step S2, for the text label, obtain the prior level information of the hierarchical classification system, the prior level information refers to the prior probability of the interdependence between the labels, which can be obtained by calculating the transition probability between the parent label and the child label. ;

步骤S3，构建深度学习层级多标签文本分类模型；Step S3, constructing a deep learning hierarchical multi-label text classification model;

所述深度学习多标签文本分类模型包括词嵌入模块，文本编码模块，标签编码模块，基于标签注意力机制文本表示模块，基于自注意力机制的文本表示模块，特征融合模块，向量回归层，关系网络模块和标签概率预测层；The deep learning multi-label text classification model includes a word embedding module, a text encoding module, a label encoding module, a text representation module based on a label attention mechanism, a text representation module based on a self-attention mechanism, a feature fusion module, a vector regression layer, and a relationship. Network module and label probability prediction layer;

步骤S4，输入数据集预处理后的文本数据到模型训练，模型训练完成之后，利用训练好的模型对多标签文本进行分类。In step S4, the preprocessed text data of the dataset is input to the model training, and after the model training is completed, the multi-label text is classified by using the trained model.

优选的，步骤S1包括以下步骤：Preferably, step S1 includes the following steps:

对数据集D中样本进行数据预处理，具体包括如下步骤：Data preprocessing is performed on the samples in the dataset D, which includes the following steps:

步骤1-1，对数据集D中的文本进行分词、去除停止词、去除标点符号；Step 1-1, perform word segmentation on the text in data set D, remove stop words, and remove punctuation;

步骤1-2，统计数据集D中的文本中的单词频率word_frequence，删除出现次数小于X1的单词，将过滤后的单词记录，构建单词表。Step 1-2, count the word frequency word_frequence in the text in the data set D, delete the words whose occurrence times are less than X1, record the filtered words, and construct a word list.

步骤1-3，数据集D经过预处理后，将数据集D按3：1：1划分为训练集，验证集，测试集。Steps 1-3, after the data set D is preprocessed, the data set D is divided into training set, validation set and test set according to 3:1:1.

优选的，步骤S2包括以下步骤：Preferably, step S2 includes the following steps:

对于数据集D中的训练集的数据，假设父节点v_i和子节点v_j之间存在层次路径e_i,j。那么由父子节点路径构成的边的特征f(e_i,j)由先验概率p(U_j|U_i)和p(U_i|U_j)表示：For the data of the training set in dataset D, assume that there is a hierarchical path e _i,j _between the parent node v _i and the child node v j . Then the feature f(e _i,j ) of the edge formed by the parent-child node path is represented by the prior probabilities p(U _j |U _i ) and p(U _i |U _j ):

f(e_i,j)表示的是两个节点的关系，这种关系用两个节点的转移概率或者共现概率来描述，两个节点的转移概率分别包括父节点到某一个子节点的转移概率p(U_j|U_i)，子节点到父节点的转移概率p(U_i|U_j)，式中，U_j和U_i分别表示文本数据被标记为v_j节点标签及被标记为ν_i节点标签，p(U_j|U_i)是给定v_i情况下被标记为v_j节点标签的条件概率，P(U_j∩U_i)是{v_j,v_i}同时被标记的概率，N_j和N_i分别表示训练集中v_j节点标签及ν_i节点标签的数量。f(e _i,j ) represents the relationship between two nodes. This relationship is described by the transition probability or co-occurrence probability of the two nodes. The transition probability of the two nodes includes the transition from the parent node to a certain child node. The probability p(U _j |U _i ), the transition probability p(U _i |U _j ) from the child node to the parent node, where U _j and U _i represent that the text data is marked as v _j node label and marked as ν _i node label, p(U _j |U _i ) is the conditional probability of being labelled as v _j node label given v _i , P(U _j ∩U _i ) is {v _j ,v _i } simultaneously labelled The probability of , N _j and N _i represent the number of v _j node labels and v _i node labels in the training set, respectively.

优选的，步骤S3还包括以下步骤：Preferably, step S3 also includes the following steps:

通过词嵌入模块对输入文本及其标签进行词嵌入处理，词嵌入处理方法具体为：The word embedding module is used to perform word embedding processing on the input text and its tags. The word embedding processing method is as follows:

步骤2-1、获得预处理后的文本序列，通过查询词嵌入表(Glove-300d)将文本中的单词{x₁,x₂,...,x_n}转换为词向量表示{w₁,w₂,...,w_n}。Step 2-1. Obtain the preprocessed text sequence, and convert the words {x ₁ , x ₂ ,...,x _n } in the text into word vector representation {w ₁ by querying the word embedding table (Glove-300d) ,w ₂ ,...,w _n }.

步骤2-2、获得层级多标签文本分类的标签集{l₁，l₂，...，l_n}，通过kaiming编码的方式，将标签集转换成一个维度为300的标签嵌入集{c₁,c₂,...,c_n}，n指的是预处理后的文本的单词数量。Step 2-2. Obtain the label set {l ₁ , l ₂ , ..., l _n } of hierarchical multi-label text classification, and convert the label set into a label embedding set with dimension 300 by means of kaiming coding {c ₁ ,c ₂ ,...,c _n }, where n refers to the number of words in the preprocessed text.

通过编码模块对词向量表示{w₁,w₂,...,w_n}进行编码处理，编码处理方法具体为：The word vector representation {w ₁ ,w ₂ ,...,w _n } is encoded by the encoding module, and the encoding processing method is as follows:

使用Bi-GRU网络对文本的词向量表示{w₁,w₂,...,w_n}进行编码，生成具有上下文语义信息的隐含表示{h₁,h₂,...,h_n}。然后将隐含表示{h₁,h₂,...,h_n}进一步送入三个卷积核大小分别为2，3，4，隐藏层数量为100的卷积，并得到三个不同感受野下的语义向量，分别将其经过最大池化后再将3个语义向量拼接成一个新的维度为300语义表示向量S＝{s₁，s₂,…,s_n}。The word vector representations {w ₁ ,w ₂ ,...,w _n } of the text are encoded using the Bi-GRU network to generate implicit representations {h ₁ ,h ₂ ,...,h _n with contextual semantic information }. Then the implicit representation {h ₁ , h ₂ ,...,h _n } is further fed into three convolutions with kernel sizes of 2, 3, and 4, and the number of hidden layers is 100, and three different convolutions are obtained. Semantic vectors under the receptive field are respectively subjected to maximum pooling, and then three semantic vectors are spliced into a new semantic representation vector S={s ₁ , s ₂ ,...,s _n } with a dimension of 300.

通过标签编码模块对标签向量表示{c₁,c₂,...,c_n}进行编码处理，标签编码处理方法具体为使用单层GCN对标签向量表示{c₁,c₂,...,c_n}进行编码，生成具有标签层次关联信息的隐含表示M＝{m₁,m₂,...,m_n}。其实现过程如下：The label vector representation {c ₁ ,c ₂ ,...,c _n } is encoded by the label encoding module. The label encoding processing method is specifically to use a single-layer GCN to encode the label vector representation {c ₁ ,c ₂ ,... ,c _n } is encoded to generate an implicit representation M={m ₁ ,m ₂ ,...,m _n } with label-level association information. The implementation process is as follows:

为了实现节点变换，本发明使用了加权邻接矩阵来表示这种线性变换，而加权邻接矩阵的初始值来自于步骤S2中层级分类体系的先验层级信息。形式上，层次GCN根据节点k的相关邻域对其隐藏状态进行编码，其中邻域N(k)＝{n_k,child(k),parent(k)}，n_k指的是层级标签树中的第k个标签节点，child(k)是指第k个节点的子标签节点，parent(k)指的是第k个节点的父标签节点，节点k的隐藏状态计算方式如下：In order to realize the node transformation, the present invention uses a weighted adjacency matrix to represent this linear transformation, and the initial value of the weighted adjacency matrix comes from the prior level information of the hierarchical classification system in step S2. Formally, a hierarchical GCN encodes its hidden state according to the relevant neighborhood of node k, where the neighborhood N(k) = {n _k , child(k), parent(k)}, n _k refers to the hierarchical label tree The kth label node in , child(k) refers to the child label node of the kth node, parent(k) refers to the parent label node of the kth node, and the hidden state of node k is calculated as follows:

上述公式中，v_j，v_k是可以训练的参数，

及

是可训练的偏置参数；对于u_k,j及g_k,j而言，可以将u_k,j理解成结点k，j之间的信息，g_k,j理解成门控值，控制u_k,j最后对节点k的影响大小；σ是指深度学习中的激活函数可以取为sigmoid函数，

b_l∈R^N×dim，及b_g∈R^N，dim为向量的维度大小，属于预先定义的超参数；d(j，k)表示从节点j到节点k的层次方向，包括自顶向下、自下而上和自循环边；其中，a_k,j∈R表示的是层次概率f_d(k,j)(e_kj)，f_d(k,j)(e_kj)指的是从第k个节点到第j个标签节点间的转移概率，它是通过f(e_i,j)得到，自循环边采用a_k,k＝1，自上而下的边使用

自下而上的边使用f_p(e_j,k)＝1；上述边的特征矩阵F＝{a_0,0,a_0,1,…,a_c-1,c-1}表示的是文本标签有向层次图的加权邻接矩阵，最后，节点k的输出隐藏状态h_k表示其对应于层次结构信息的标签表示。In the above formula, v _j , v _k are parameters that can be trained,

and

is a trainable bias parameter; for uk _,j and gk _,j , uk _,j can be understood as the information between nodes k, j, _gk,j as the gate value, control The final influence of u _{k, j} on node k; σ means that the activation function in deep learning can be taken as the sigmoid function,

b _l ∈R ^N×dim , and b _g ∈R ^N , dim is the dimension of the vector, which is a predefined hyperparameter; d(j, k) represents the hierarchical direction from node j to node k, including top-to-bottom Bottom, bottom-up, and self-loop edges; where a _k,j ∈ R denotes the hierarchical probability f _d(k,j) (e _kj ), and f _d(k,j) (e _kj ) refers to The transition probability from the kth node to the jth label node, which is obtained by f(e _i,j ), the self-loop edge adopts a _k,k = 1, and the top-down edge uses

The bottom-up edge uses f _p (e _j,k )=1; the characteristic matrix F={a _0,0 ,a _0,1 ,...,a _c-1,c-1 } of the above edge represents A weighted adjacency matrix of a directed hierarchical graph of text labels, and finally, the output hidden state h _k of node k represents its label representation corresponding to the hierarchical information.

基于标签注意力机制文本表示模块的提取方法为：对来自文本编码层的文本表示

以及来自标签编码层的标签表示

通过以下公式计算基于标签注意力的文本表示：The extraction method of the text representation module based on the label attention mechanism is: the text representation from the text encoding layer

and the label representation from the label encoding layer

The label attention based text representation is calculated by the following formula:

基于自注意力机制文本表示模块的提取方法具体为：对来自文本编码层中Bi-GRU输出的隐层文本表示

通过以下公式计算基于自注意力机制的文本表示：The extraction method of the text representation module based on the self-attention mechanism is specifically: the hidden layer text representation from the Bi-GRU output in the text encoding layer

特征融合模块为：将与基于标签注意力机制的文本特征和基于自注意力机制的文本特征进行自适应融合，得到最终的文本特征d_ik-fusion，计算方式如下：The feature fusion module is: adaptively fuse with the text features based on the label attention mechanism and the text features based on the self-attention mechanism to obtain the final text feature d _ik-fusion , and the calculation method is as follows:

利用关系网络模块对标签间的关联信息进一步挖掘：挖掘方法为将特征融合模块产生的文本特征d_ik-fusion输入到全连接层，得到每标签对应的logits向量O＝{o₁,o₂,...,o_n},然后将向量O输入到关系网络模块得到预测向量y＝{y₁,y₂,...,y_n}，最后将预测向量y输入到多层感知机，即能得到标签预测概率,其中关系网络本质是一个残差网络。Use the relational network module to further mine the association information between tags: the mining method is to input the text feature d _ik-fusion generated by the feature fusion module into the fully connected layer, and obtain the logits vector corresponding to each tag O={o ₁ ,o ₂ , ...,on }, then input the vector O to the relational network module to get the predicted vector y={y ₁ ,y ₂ ,...,y _n }, and finally input the predicted vector y to the multi _- layer perceptron, namely The label prediction probability can be obtained, and the relationship network is essentially a residual network.

优选的，步骤S4包括以下步骤：Preferably, step S4 includes the following steps:

训练过程中需要使用采用交叉熵损失函数，使用Adam优化器进行训练，多标签文本分类的交叉熵损失函数如下：In the training process, the cross-entropy loss function needs to be used, and the Adam optimizer is used for training. The cross-entropy loss function of multi-label text classification is as follows:

其中，y_ij为第i个样本对于第j个标签的实际概率，为第i个样本对于第j个标签的预测概率，最终得到训练好的深度学习多标签文本分类模型，L指的是标签类别的数量，N指的是文本样本的数量。Among them, y _ij is the actual probability of the i-th sample for the j-th label, and is the predicted probability of the i-th sample for the j-th label, and finally the trained deep learning multi-label text classification model is obtained, and L refers to the label The number of categories, N refers to the number of text samples.

本文中所描述的具体实施例仅仅是对本发明精神作举例说明。本发明所属技术领域的技术人员可以对所描述的具体实施例做各种各样的修改或补充或采用类似的方式替代，但并不会偏离本发明的精神或者超越所附权利要求书所定义的范围。The specific embodiments described herein are merely illustrative of the spirit of the invention. Those skilled in the art to which the present invention pertains can make various modifications or additions to the described specific embodiments or substitute in similar manners, but will not deviate from the spirit of the present invention or go beyond the definitions of the appended claims range.

Claims

1. A hierarchical multi-label text classification method based on mixed attention is characterized by comprising the following steps:

step S1, preprocessing the multi-label text data in the data set D;

step S2, acquiring prior hierarchy information of a hierarchy classification system aiming at the text labels, wherein the prior hierarchy information refers to prior probability of mutual dependence between the labels and can be obtained by calculating transition probability between a father label and a son label;

step S3, constructing a deep learning level multi-label text classification model;

the deep learning multi-label text classification model comprises a word embedding module, a text coding module, a label coding module, a text representation module based on a label attention mechanism, a text representation module based on an attention mechanism, a feature fusion module, a vector regression layer, a relation network module and a label probability prediction layer;

and step S4, inputting the text data after the data set preprocessing to model training, and after the model training is finished, classifying the multi-label text by using the trained model.

2. The method of claim 1, wherein the hierarchical multi-label text classification based on label attention comprises: in step S1, preprocessing the text data in the data set D includes the following steps:

step 1.1, performing word segmentation, removing stop words and removing punctuation marks;

step 1.2, counting word frequency word _ frequency in the text in the data set D, deleting words with the occurrence frequency less than X1, recording the filtered words, and constructing a word list.

After the data set D is preprocessed, the data set D is divided into a training set, a verification set and a test set according to a certain proportion.

3. The method of claim 1, wherein the hierarchical multi-label text classification based on label attention comprises: the specific implementation of the step S2 includes;

for data in dataset D, assume parent node v _i And child node v _j There is a hierarchical path e between _i，j Then the characteristic f (e) of the edge formed by the path of the parent and child nodes _i，j ) From a priori probability p (U) _j |U _i ) And p (U) _i |U _j ) Represents:

f(e _i，j ) The relation of two nodes is represented, the relation is described by the transition probability or the co-occurrence probability of the two nodes, and the transition probability of the two nodes respectively comprises the transition probability p (U) from a parent node to a child node _j |U _i ) Transition probability p (U) of child node to parent node _i |U _j ) In the formula, U _j And U _i Respectively representing text data marked v _j Node label and is marked as v _i Node label, p (U) _j |U _i ) Is given by v _i In the case marked v _j Conditional probability of node label, P (U) _j ∩U _i ) Is { v _j ，v _i Probability of being marked simultaneously, N _j And N _i Respectively representing a training set v _j Node labels and v _i The number of node labels.

4. The method of claim 3, wherein the hierarchical multi-label text classification based on label attention comprises: in step S3, word embedding processing is performed on the input text and the tag thereof by the word embedding module, and the word embedding processing method specifically includes:

step 2.1, obtaining the preprocessed text sequence, and embedding words { x ] in the text into a dictionary table through the query words ₁ ，x ₂ ，...，x _n Convert to word vector representation w ₁ ，w ₂ ，...，w _n }；

Step 2.2, obtaining a label set { l ] of the hierarchical multi-label text classification ₁ ，l ₂ ，...，l _n Converting the label set into a label embedding set with dimension d { c } by means of kaiming coding ₁ ，c ₂ ，...，c _n N refers to the number of words of the preprocessed text;

5. the method of claim 4, wherein the hierarchical multi-label text classification based on label attention comprises: in step S3Representing the word vector by the text encoding module w ₁ ，w ₂ ，...，w _n Performing coding processing, wherein the coding processing method specifically comprises the following steps:

word vector representation of text using Bi-GRU network w ₁ ，w ₂ ，...，w _n Encode, generating an implicit representation h with context semantic information ₁ ，h ₂ ，...，h _n Will then implicitly represent h ₁ ，h ₂ ，...，h _n Sending the data into three convolutions with different convolution kernel sizes, obtaining semantic vectors under three different receptive fields, and finally splicing the 3 semantic vectors into a new semantic expression vector S ═ S ₁ ，s ₂ ，...，s _n }。

In step S3, label vector is represented by label encoding module { c ₁ ，c ₂ ，...，c _n Performing encoding processing, wherein the label encoding processing method specifically comprises the following steps:

using single-layer GCN to represent the tag vector c ₁ ，c ₂ ，...，c _n Encoding to generate an implicit expression M ═ M with label hierarchy association information ₁ ，m ₂ ，...，m _n The implementation process is as follows:

hierarchical GCN aggregates data flows from top to bottom, bottom to top and within self-looping edges, in hierarchical GCN each directed edge represents a pair of label related features, these data flows are node transformed using edgewise linear transformation;

to implement the node transformation, a weighted adjacency matrix is used to represent the linear transformation, and the initial value of the weighted adjacency matrix comes from the prior hierarchical information of the hierarchical classification system in step S2, and the hierarchical GCN formally encodes its hidden state according to the relevant neighborhood of node k, where the neighborhood n (k) { n } n _k ，child(k)，parent(k)}，n _k Referring to the kth label node in the hierarchical label tree, child (k) refers to the child label node of the kth node, parent (k) refers to the parent label node of the kth node, and the hidden state of the node k is calculated as follows:

in the above formula, v _j ，v _k Is a parameter that can be trained on,

and

is a trainable bias parameter; for u _k，j And g _k，j In other words, u may be _k，j Understood as information between nodes k, j, g _k，j Understood as a gated value, control u _k，j Finally, the influence on the node k is large; sigma means that the activation function in deep learning can be taken as sigmoid function,

b _l ∈R ^N×dim and b is _g ∈R ^N Dim is the dimension of the vector and belongs to a predefined hyperparameter; d (j, k) represents the hierarchical direction from node j to node k, including top-down, bottom-up, and self-circulating edges; wherein, a _k，j E.r denotes the hierarchical probability f _d(k，j) (e _kj )，f _d(k，j) (e _kj ) Refers to the transition probability from the kth node to the jth tag node, which is determined by f (e) _i，j ) Is obtained by self-circulation side using a _k，k 1, used from top to bottom

Using from bottom to top f _p (e _j，k ) 1 is ═ 1; the feature matrix F of the above edge is { a ═ a _0，0 ，a _0，1 ，...，a _c-1，c-1 The expression is the weighted adjacency matrix of the text label directed hierarchical graph, and finally, the output hidden state h of the node k _k Indicating that it corresponds to a tag representation of the hierarchy information.

6. The method of claim 5, wherein the hierarchical multi-label text classification based on label attention comprises: the extraction method of the text representation module based on the label attention mechanism in the step S3 is as follows: for text representation from text coding layer

And label representation from a label coding layer

d _c The dimensional size of the text encoding vector is represented, and is a predetermined fixed value, and the text representation based on the attention of the label is calculated by the following formula:

wherein alpha is _kj Represents the amount of information, v, of the jth text feature vector to the kth tag _k I.e. a text representation based on the tag attention.

7. The method of claim 6, wherein the hierarchical multi-label text classification based on label attention comprises: the extraction method based on the self-attention mechanism text representation module in step S3 specifically includes: hidden layer text representation of Bi-GRU output from text encoding layer

A textual representation based on the self-attention mechanism is calculated by the following formula:

wherein w ₁ ，w ₂ H is a text representation, α, as a parameter _kt In a text representationWeight, u, occupied by the t-th vector _k Is a textual representation based on the self-attention mechanism.

8. The method of claim 7, wherein the hierarchical multi-label text classification based on label attention comprises: the feature fusion module in step S3 is: the text features based on the label attention mechanism and the text features based on the self-attention mechanism are subjected to self-adaptive fusion to obtain final text features d _ik-fusion The calculation method is as follows:

wherein w ₁ ，w ₂ Is a parameter, v _k For text representation based on tag attention, u _k For self-attention based text representation, β _k Is v is _k The occupied weight.

9. The method of claim 8, wherein the hierarchical multi-label text classification based on label attention comprises: the relationship network module in step S3 further mines the association information between the tags: the mining method is to fuse the text features d generated by the feature fusion module _ik-fusion Inputting the data into a full connection layer to obtain a logits vector O ═ O corresponding to each label ₁ ，o ₂ ，...，o _n And inputting the vector O into a relational network module to obtain a prediction vector y ═ y ₁ ，y ₂ ，...，y _n And finally, inputting the prediction vector y into a multilayer perceptron to obtain the label prediction probability, wherein the nature of the relational network is a residual network.

10. The label attention based hierarchical multi-label text classification method according to claim 1, characterized in that: in the training process in the step S4, a cross entropy loss function is used, and an Adam optimizer is used for training, and the cross entropy loss function of multi-label text classification is as follows:

wherein, y _ij The actual probability of the ith sample to the jth label,

and finally obtaining a trained deep learning multi-label text classification model for the prediction probability of the ith sample to the jth label, wherein L refers to the number of label categories, and N refers to the number of text samples.