CN116069931A

CN116069931A - Hierarchical label text classification method, system, equipment and storage medium

Info

Publication number: CN116069931A
Application number: CN202310108445.0A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Chengdu Shuzhilian Technology Co Ltd
Current assignee: Chengdu Shuzhilian Technology Co Ltd
Priority date: 2023-02-14
Filing date: 2023-02-14
Publication date: 2023-05-05

Abstract

The invention provides a hierarchical label text classification method, a hierarchical label text classification system, hierarchical label text classification equipment and a hierarchical label text classification storage medium, and relates to the field of natural language processing, wherein the method comprises the following steps: constructing a sequence generation model based on the hierarchical label classification task, and pre-training the sequence generation model; inputting text data into a pre-trained sequence generation model, and extracting multi-granularity text feature vectors by adopting a sequence generation mode; coding the text feature vector with multiple granularities and the predicted last-level label by adopting an attention mechanism to obtain a coding vector containing information of the last-level label; decoding the coded vector containing the label information of the previous level by adopting a time sequence network, and predicting the label vector of the next level; the generation of all hierarchy tags is controlled by a masking operation. The method and the device effectively solve the problem that the classification performance of the hierarchical labels is poor due to the fact that the relevance among the hierarchical labels and the limitation of coarse-granularity labels to fine-granularity labels are not fully utilized in the traditional hierarchical label classification task.

Description

Hierarchical label text classification method, system, equipment and storage medium

Technical Field

The invention relates to the technical field of natural language processing, in particular to a hierarchical label text classification method, a hierarchical label text classification system, hierarchical label text classification equipment and a hierarchical label text classification storage medium based on multi-granularity feature extraction and label sequence generation.

Background

Text classification is the most common application scenario in the field of natural language processing, and text classification can be divided into a unit tag classification task and a multi-element tag classification task according to the number of tags corresponding to data. Further, in the multi-tag classification task, the parallel tag classification task and the hierarchical tag classification task may be classified according to whether the tags contain a hierarchical relationship. In practical application scenarios, the labels in most text classification tasks have hierarchical relationships, for example, in document classification tasks, the labels of each document usually take the form of a "catalog", and include coarse-granularity labels and fine-granularity labels corresponding to the coarse-granularity labels, so how to improve the performance of hierarchical label text classification, and solve the problem to be solved in the hierarchical label text classification tasks in practical application scenarios.

The conventional hierarchical label text classification task is generally regarded as a multi-label classification task and mainly comprises two processing modes, wherein one processing mode is based on the idea of task conversion, a combination of a previous label, a previous label and a next label is regarded as a category, the hierarchical label classification task is converted into a parallel label classification task, the relevance among the hierarchical labels is considered, but when the number of the labels is large, the dimension disasters of the labels can be caused, the other processing mode is based on the idea of algorithm adaptability, the algorithm is improved to adapt to the multi-label classification task, the output layer of the neural network is used for carrying out two-class classification on each label, all the predicted labels are finally combined together to be used as output, and the method for converting the hierarchical label classification into the parallel label classification has the disadvantage that the relevance among the hierarchical labels is completely ignored. Thus, in summary, it can be seen that the traditional hierarchical label text classification task does not take full advantage of the relevance between hierarchical labels and the limitations of potentially coarse-grained labels on fine-grained labels.

Disclosure of Invention

The embodiment of the invention provides a hierarchical label text classification method, a hierarchical label text classification system, hierarchical label text classification equipment and a hierarchical label text classification storage medium, which effectively solve the problem that the hierarchical label classification performance is poor due to insufficient utilization of relevance among hierarchical labels and limitation of potential coarse-granularity labels to fine-granularity labels in the traditional hierarchical label classification task.

In a first aspect, an embodiment of the present invention provides a hierarchical label text classification method, including the steps of:

(1) Constructing a sequence generation model based on the hierarchical label classification task, and pre-training the sequence generation model;

(2) Inputting the text data into a pre-trained sequence generation model, wherein the sequence generation model performs the following processing on the text data:

(2.1) extracting multi-granularity text feature vectors by adopting a sequence generation mode;

(2.2) encoding the multi-granularity text feature vector and the predicted previous-level label by adopting an attention mechanism to obtain an encoded vector containing information of the previous-level label;

(2.3) decoding the encoded vector containing the previous-level tag information using the time-series network to predict the next-level tag vector and iteratively updating the previous-level tag;

(2.4) controlling generation of all hierarchical labels using a masking operation.

In the embodiment, the relevance among the labels and the limitation of the potential coarse-granularity labels on the fine-granularity labels are fully considered, the hierarchical labels are generated in a sequence generation mode, and the label information is related, so that hierarchical label text classification can be efficiently and accurately carried out; and the uncontrollability in the hierarchy label generating task is solved by controlling the generation of the hierarchy label in a mask mode.

As some optional embodiments of the present application, the sequence generating model adopts the structure of seq2seq, and mainly includes an encoder and a decoder, where the encoder functions to convert text data into a hidden vector containing features of the text data, and the decoder functions to convert the hidden vector containing features into corresponding labels.

As some optional embodiments of the present application, the process of pre-training the sequence generation model is as follows:

(1.1) acquiring training data and constructing a tag vocabulary based on the training data;

(1.2) inputting a tag vocabulary into a sequence generation model, the sequence generation model assigning an initial default vector for each tag in the tag vocabulary;

(1.3) inputting the training data into a sequence generation model that converts sentences in the training data into index vectors of keywords.

In the above embodiment, the sequence generation model is pre-trained, that is, the model is configured correspondingly, so that the sequence generation model can quickly and accurately classify the hierarchical labels according to the rule of the hierarchical label classification task.

As some optional embodiments of the present application, the procedure of extracting multi-granularity text feature vectors by adopting a sequence generation manner is as follows:

(2.11) inputting the text data into a plurality of coding layers of an encoder, and coding the text data by the coding layers based on a sequence generation mode so as to obtain text feature vectors of keywords;

and (2.12) summing the text feature vectors of each coding layer to obtain the coding vector of each coding layer corresponding to the keyword.

As some optional embodiments of the present application, the following is a procedure for encoding a multi-granularity text feature vector and a predicted previous-level label by using an attention mechanism to obtain an encoded vector containing information of the previous-level label:

(2.21) vectorizing a last-level tag obtained by sequence generation model prediction and performing iterative update to obtain a tag vector;

(2.22) mapping the coded vector at the character level to obtain a value vector and a key vector, and mapping the value vector and the key vector to obtain a corresponding index vector;

(2.23) performing cross attention computation based on the index vector and the key vector to obtain an attention weight vector;

(2.24) performing normalization calculation based on the attention weight vector to obtain a normalized weight vector;

(2.25) weighting and summing the value vectors based on the normalized weight vectors to obtain a coded vector containing the tag information of the previous level.

In the above embodiment, features are extracted from the multi-granularity information of the text data to be fused, and the attention mechanism is utilized to obtain the characteristic information of the keywords, so that the information contained in the coding vector of the model is enriched, and the classification effect of the model is improved.

As some optional embodiments of the present application, the process of decoding the encoded vector containing the information of the label of the previous layer by using the time series network to predict the label of the next layer, and performing iterative update on the label of the previous layer is as follows:

(2.31) inputting the encoded vector containing the tag information of the previous hierarchy to a decoder for decoding to obtain an intermediate vector;

(2.32) performing linear transformation on the intermediate vector to obtain a prediction vector of the next-level tag, and iteratively updating the previous-level tag based on the predicted next-level tag until all-level tag predictions are finished.

As some optional embodiments of the present application, the flow of controlling the generation of all hierarchy tags using a masking operation is as follows:

(2.41) masking the partial prediction vector to obtain an indication vector;

(2.42) calculating probability values for all tags based on the prediction vector and the indication vector.

In the above embodiment, in order to solve the uncontrollability of the hierarchical label classification task, the mask processing is used to control the hierarchical label classification task, so that the hierarchical label classification task can be accurately performed.

In a second aspect, the present invention provides a hierarchical tag text classification system, a sequence generation model construction unit that constructs a sequence generation model based on a hierarchical tag classification task;

the sequence generation model pre-training unit is used for pre-training the sequence generation model;

the sequence generation model unit is used for inputting the text data into a pre-trained sequence generation model and predicting the hierarchical label of the text data;

wherein the sequence generation model unit includes:

the text feature vector extraction module is used for extracting multi-granularity text feature vectors in a sequence generation mode;

the attention mechanism module adopts an attention mechanism to encode the text feature vector with multiple granularities and the predicted last-level label so as to obtain an encoded vector containing information of the last-level label;

a hierarchical label prediction module that decodes a coded vector containing information of a previous hierarchical label using a time-series network to predict a next hierarchical label and iteratively updates the previous hierarchical label;

and a masking operation control module which controls generation of all hierarchy tags by using masking operation.

In a third aspect, the invention provides a computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor performing the hierarchical tag text classification method.

In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the one hierarchical label text classification method.

The beneficial effects of the invention are as follows:

1. the relevance among the labels and the limitation of the potential coarse-granularity labels on the fine-granularity labels are fully considered, a hierarchical label is generated in a sequence generation mode, and label information is associated, so that hierarchical label text classification can be efficiently and accurately carried out; and the uncontrollability in the hierarchical label classification task is solved by controlling the generation of the hierarchical label in a mask processing mode.

2. By extracting features from multi-granularity information of text data to fuse, the characteristic information of keywords is acquired by using an attention mechanism, and information contained in the coding vector of the sequence generation model is enriched, so that the classification effect of the sequence generation model is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a step diagram of a hierarchical label text classification method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the structure of a hierarchical tag according to an embodiment of the present invention;

FIG. 3 is a block diagram of a hierarchical tag sequence generation model according to an embodiment of the present invention;

fig. 4 is a block diagram of a hierarchical tag text classification system according to an embodiment of the invention.

Detailed Description

In order to better understand the above technical solutions, the following detailed description of the technical solutions of the present invention is made by using the accompanying drawings and specific embodiments, and it should be understood that the specific features of the embodiments and the embodiments of the present invention are detailed descriptions of the technical solutions of the present invention, and not limiting the technical solutions of the present invention, and the technical features of the embodiments and the embodiments of the present invention may be combined with each other without conflict.

It should also be appreciated that in the foregoing description of at least one embodiment of the invention, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of at least one embodiment of the invention. This method of disclosure, however, is not intended to imply that more features than are required by the subject invention. Indeed, less than all of the features of a single embodiment disclosed above.

Example 1

The invention provides a hierarchical label text classification method, which is used for realizing hierarchical label text classification based on a mode of multi-granularity feature extraction and label sequence generation, and referring to fig. 1, the method comprises the following steps:

step (1): and constructing a sequence generation model based on the hierarchical label classification task, and pre-training the sequence generation model.

In the embodiment of the invention, a sequence generation model is constructed based on the hierarchical label classification task, namely the hierarchical label classification task is converted into a sequence generation task, and the sequence generation task is defined as follows:

wherein x is text data containing m keywords, y is a set containing n labels corresponding to x, referring to fig. 2, the total of three hierarchical labels (not limited to three, set according to hierarchical label classification task), y ₀ Representing the top-level label, y ₁ Representing a first level label, y ₂ Representing a second level tag, and so on, knowing y _i-1 Representing the i-1 th level label.

Specifically, the hierarchical tag classification model generated based on the tag sequence adopts the structure of seq2seq, please refer to fig. 3, the left side of the hierarchical tag classification model is an encoder, and the text data is encoded mainly based on the sequence generation mode; to the right is a decoder, mainly through a time-series network, to predict the next level label from the top level label.

Specifically, the process of pre-training the sequence generation model is as follows:

(1.1) acquiring training data, and constructing a tag vocabulary based on the training data.

(1.2) inputting the tag vocabulary into a sequence generation model that assigns an initial default vector to each tag in the tag vocabulary.

In the implementation of the invention, the sequence generation model is correspondingly configured by pre-training the sequence generation model, so that the sequence generation model can quickly and accurately classify the hierarchical labels according to the rule of the hierarchical label classification task.

Step (2): inputting text data into a pre-trained sequence generation model to perform hierarchical label classification processing, namely generating hierarchical labels in a sequence generation mode by fully considering the relevance among labels and the limitation of potential coarse-granularity labels on fine-granularity labels, and correlating label information, so that hierarchical label text classification can be performed efficiently and accurately; and the uncontrollability in the hierarchy label generating task is solved by controlling the generation of the hierarchy label in a mask mode.

Specifically, the sequence generation model performs hierarchical label classification processing on text data as follows:

(2.1) extracting multi-granularity text feature vectors by adopting a sequence generation mode.

In the embodiment of the invention, the process of extracting the text feature vectors with multiple granularities by adopting a sequence generation mode is as follows:

(2.11) referring to the left Encoder (Encoder) of fig. 3, text data is input into the Encoder, the Encoder encodes Text data (Text) based on a sequence generation manner to obtain Text feature vectors of keywords, namely, text feature vectors of keywords are obtained by performing different encoding based on a plurality of encoding layers (token embedding layers, segment embedding layers and position embedding layers), and the Text feature vectors of each keyword are formed by adding three parts, namely, character vectors, segment vectors and position vectors, and the Text feature vectors are defined as follows:

e＝e _token +e _segment +e _position

wherein e _token Refers to a character vector coded by a sequence generation model, represents semantic information of a character level, e _segment Refers to a sequence generation model encoded segment vector for distinguishing sentence pairs, e _position Refers to a sequence generation model encoded position vector for introducing position information.

(2.12) summing the text feature vectors input into each coding layer to obtain the coding vector of each coding layer corresponding to the keyword.

Specifically, the encoder of the sequence generation model is formed by stacking a plurality of encoding layers (layer_1, layer_2 to layer_n), semantic information contained in each Layer is different, and a final encoding vector is formed by summing text feature vectors of each Layer, wherein the summation formula is as follows:

wherein L refers to the number of layers contained in the sequence generation model, h _ik Text feature vector representing ith keyword generates coding vector of model kth layer in sequence, h _i Representing the final encoded vector of the ith sequence generation model.

And (2.2) coding the multi-granularity text feature vector and the predicted last-level label by adopting an Attention mechanism (Attention) to acquire a coded vector containing the information of the last-level label, namely extracting features from the multi-granularity information of the text data to fuse, acquiring the characteristic information of the keywords by utilizing the Attention mechanism, enriching the information contained in the coded vector of the model, and improving the classification effect of the model.

Specifically, the process of obtaining the encoded vector containing the tag information of the previous hierarchy is as follows:

(2.21) vectorizing the last-level tags obtained by sequence generation model prediction and performing iterative update to obtain tag vectorsThat is, the label obtained by the previous prediction is converted into the corresponding vector representation g (y _t-1 ) G (y when t=1 ₀ ) Is the initial default vector.

(2.22) mapping the character-level code vector H to obtain a value vector V and a key vector K, and mapping the label vector to obtain an index vector q, wherein the formula is as follows:

K＝W _k H V＝W _v H q＝W _q g(y _t-1 )

wherein W is _k 、W _v W is provided _q Is a preset weight parameter, H= [ H ] ₁ ,h ₂ ,...,h _m ]，K＝[k ₁ ,k ₂ ,...,k _m ]，V＝[v ₁ ,v ₂ ,...,v _m ]。

(2.23) performing cross attention computation based on the index vector q and the key vector K to obtain an attention weight vector a _t The formula is as follows:

where d refers to the dimension of the index vector q.

(2.24) applying a Softmax function to the attention weight vector a _t Performing normalization calculation to obtain a normalized weight vector s _t The formula is as follows:

(2.25) normalized weight vector s-based _t The value vectors V are weighted and summed to obtain the code vector r containing the label information of the previous level _t The formula is as follows:

(2.3) decoding the encoded vector containing the information of the previous level tag using the time series network to predict the next level tag and iteratively updating the previous level tag, see fig. 3 right Decoder.

In the embodiment of the present invention, the process of decoding the encoded vector containing the tag information of the previous hierarchy using the time-series network is specifically as follows:

(2.31) the code vector r to be containing the tag information of the previous level _t An input decoder (LSTM model) decodes to obtain an intermediate vector l _t The formula is as follows:

l _t ＝LSTM(r _t )

(2.32) intermediate vector l _t Performing linear transformation to obtain a predictive vector o of a label of the next level _t The formula is as follows:

o _t ＝f(W _d l _t )

wherein W is _d Is a weight parameter, f represents a nonlinear activation function, and the last level of labels are iteratively updated until all label predictions are over.

(2.4) controlling generation of all hierarchy tags using a MASK operation (MASK).

In the embodiment of the invention, the mask operation control flow is as follows:

(2.41) masking the partial-level tags to obtain an indication vector I _t The specific rules are as follows:

wherein t is {1,2, …, N }, N is the number of tiers,

representing the label range of the t-th layer,

the label value range representing the t layer is related to the t-1 layer value.

The instruction vector I is described below by taking a level 3 tag as an example _t Is calculated by the computer.

If: the hierarchical label is level 3:

if the current time step t=1:

wherein Y is _j1 Representing a first level tag range of values,

representing the number of first level tags.

If the current time step t=2, the first level tag is Y ₂ I.e. j ₁ ＝2：

Wherein, the liquid crystal display device comprises a liquid crystal display device,

representing the first layer as Y ₂ When the second layer label takes value range, < >>

The number of second level labels is represented and the number of second level labels is related to the first level label.

If the current time step t=3, the first level tag is Y ₂ The second level label is Y _2,1 ：

Wherein Y is _2,1,j3 Representing the first layer as Y ₂ The second layer of label is Y _2,1 When the third layer label takes the value range,

representing the number of third level tags and the number of third level tags being related to the second level tags;

(2.42) calculating probability values for all tags based on the prediction vector and the indication vector, and selecting based on the probability values for the tags, as follows:

y _t ＝softmax(o _t +I _t )

i.e. when the probability value is greater than the threshold value, then the next level label is predicted.

Example 2

The present invention provides a hierarchical label text classification system, which corresponds to the method of embodiment 1 one by one, referring to fig. 4, and includes:

a sequence generation model construction unit which constructs a sequence generation model based on the hierarchical label classification task;

wherein the sequence generation model unit includes:

Example 3

The invention provides a computer device comprising a memory and a processor, the memory storing a computer program which, when run by the processor, performs a hierarchical label text classification method as described in embodiment 1.

The computer device provided in this embodiment may implement the method described in embodiment 1, and in order to avoid repetition, a description thereof will be omitted.

Example 4

The present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a hierarchical label text classification method as described in embodiment 1.

The computer readable storage medium provided in this embodiment may implement the method described in embodiment 1, and will not be described herein in detail to avoid repetition.

The processor may be a central processing unit (CPU, central Processing Unit), other general purpose processors, digital signal processors (digital signal processor), application specific integrated circuits (Application Specific Integrated Circuit), off-the-shelf programmable gate arrays (Fieldprogrammable gate array) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may be used to store the computer program and/or the modules, and the processor may implement various functions of the inventive hierarchical label text classification system by executing or executing data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and the like. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart memory card, secure digital card, flash memory card, at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The hierarchical tag text classification system, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding that the present invention implements all or part of the flow of the method of the above-described embodiments, the steps of each method embodiment described above may also be implemented by a computer program stored in a computer readable storage medium, where the computer program when executed by a processor. Wherein the computer program comprises computer program code, object code forms, executable files, or some intermediate forms, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory, a random access memory, a point carrier signal, a telecommunication signal, a software distribution medium, and the like. It should be noted that the content of the computer readable medium can be appropriately increased or decreased according to the requirements of the legislation and the patent practice in the jurisdiction.

Having described the basic concept of the invention, it will be apparent to those skilled in the art that the foregoing detailed disclosure is by way of example only and is not intended to be limiting. Although not explicitly described herein, various modifications, improvements, and adaptations to the present disclosure may occur to one skilled in the art. Such modifications, improvements, and modifications are intended to be suggested within this specification, and therefore, such modifications, improvements, and modifications are intended to be included within the spirit and scope of the exemplary embodiments of the present invention.

The computer storage medium may contain a propagated data signal with the computer program code embodied therein, for example, on a baseband or as part of a carrier wave. The propagated signal may take on a variety of forms, including electro-magnetic, optical, etc., or any suitable combination thereof. A computer storage medium may be any computer readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code located on a computer storage medium may be propagated through any suitable medium, including radio, cable, fiber optic cable, RF, or the like, or a combination of any of the foregoing.

Claims

1. A hierarchical label text classification method, characterized in that the method comprises the steps of:

2. A hierarchical label text classification method in accordance with claim 1, wherein: the sequence generation model adopts the structure of seq2seq, including an encoder and a decoder.

3. A hierarchical label text classification method in accordance with claim 1, wherein: the process of pre-training the sequence generation model is as follows:

4. The hierarchical label text classification method according to claim 2, wherein the process of extracting multi-granularity text feature vectors by means of sequence generation is as follows:

5. The hierarchical label text classification method according to claim 4, wherein the process of encoding the multi-granularity text feature vector and the predicted previous hierarchical label using the attention mechanism to obtain the encoded vector containing the previous hierarchical label information is as follows:

6. The hierarchical label text classification method according to claim 2, wherein the encoding vector containing the information of the previous hierarchical label is decoded by using a time-series network to predict the next hierarchical label, and the iterative update of the previous hierarchical label is performed as follows:

(2.32) performing linear transformation on the intermediate vector to obtain a prediction vector of the label of the next level, and iteratively updating the label of the previous level until all label predictions are finished.

7. The hierarchical label text classification method according to claim 6, wherein the flow of controlling the generation of all hierarchical labels by masking operation is as follows:

(2.41) masking the partial prediction vector to obtain an indication vector;

8. A hierarchical tag text classification system, the system comprising:

wherein the sequence generation model unit includes:

9. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized by: the processor, when executing the computer program, implements a hierarchical label text classification method as claimed in any one of claims 1-7.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements a hierarchical label text classification method according to any of claims 1-7.