CN112860889A

CN112860889A - BERT-based multi-label classification method

Info

Publication number: CN112860889A
Application number: CN202110121995.7A
Authority: CN
Inventors: 郑文; 张和伟; 邓丽平; 侯凡
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2021-01-29
Filing date: 2021-01-29
Publication date: 2021-05-28

Abstract

The invention discloses a BERT-based multi-tag classification method, in particular to a BERT-based sentence classification task, which determines whether a sentence is marked by a tag or not by judging the context relationship between the sentence and the tag. Which comprises the following steps: the device comprises a data preprocessing module, a BERT fine adjustment module and a classifier module. The invention respectively forms sentences and all labels in the text into sentence pairs, and makes the sentence vectors of the sentences and the labels represent semantic information rich in context by utilizing the advantage that the classification task of the sentence pairs of the BERT model has obvious advantages in multiple fields. Finally, the obtained sentence vector is transmitted into a classifier module to obtain the semantic relation between the sentence and the label, so that whether the sentence is marked by the label is predicted. The method can greatly reduce the data required by training and ensure better results.

Description

BERT-based multi-label classification method

Technical Field

The invention relates to the technical field of natural language processing, in particular to a BERT-based multi-label classification method.

Background

Nowadays, the world is in the third wave of artificial intelligence, various fields generate various data, and machine learning methods are urgently needed to be introduced, so that intellectualization, informatization and industrial upgrading are realized. In order to extract the rich information in the data, the traditional way of manually carrying out data induction, analysis and classification tasks is widely used in the internet field and is replaced by a machine learning method. Meanwhile, various conventional fields are more urgent to expect the acceleration of industrial upgrading through machine learning. In order to learn a large amount of information in data more efficiently, the various directions of machine learning have been developed rapidly in recent years, and the research contents are deeper and deeper, and the research field is wider and wider. The classification problem has a high application value as one of important research directions in the field of machine learning, and has also received wide attention from a large number of researchers and practitioners in various fields.

In the real world, data accumulation is often a long-term collection process, and objects of a classification task generally belong to multiple categories, namely, multiple labels are associated. In the early days of application of machine learning methods, data with multiple tags is often the more common situation. In recent years, research on multi-label learning problems has attracted wide attention, and becomes a popular direction for research in two machine learning fields, and an application scenario of a conventional classification learning method is generally set as a single-label classification problem, and each instance is only associated with one label suitable for attribute characteristics of the single-label classification problem. However, in the real world, a label for an instance typically has a set of tags associated with it. For example, in the process of searching a paper database, a single-label search classification problem can be realized by using a paper title, but the paper search is not convenient. In the actual retrieval process, a keyword manner is usually adopted to perform classified retrieval on related papers, and a paper often includes a plurality of keywords. When the text is classified by the multi-keyword application, the traditional single-label supervised learning is not completely suitable for the multi-label classification task. Therefore, the importance of the multi-label classification problem which is more in line with the real life is highlighted.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a BERT-based multi-label classification method aiming at the defects in the prior art.

The technical scheme adopted by the invention for solving the technical problems is as follows: constructing a BERT-based multi-label classification method, comprising:

selecting a CAIL-2019 data set as a corpus, combining all text data in the data set with different labels, and marking new labels for sentence combinations according to a label list of sentences;

performing word segmentation on the processed text data, connecting a [ CLS ] mark at the beginning of a sentence of each text data, and adding an [ SEP ] mark between the sentence and the label;

vectorizing the text data after word segmentation, and representing each word in the input text data by using a pre-trained word feature vector to obtain a vector of the text data after word segmentation;

extracting the feature word vectors of the text data and the feature word vectors of the labels after word segmentation, and obtaining semantically fused sentence vectors by utilizing self-attention operation;

and inputting the sentence vector into a feedforward neural network model, and predicting the relation of the sentences through the output result of the model.

Wherein, in the step of combining all text data in the data set with different tags, each sentence in the sentence pair is combined with each tag once.

In the step of segmenting the processed text data, a sequence formed by splicing the pre-defined symbols [ CLS ] and [ SEP ] is obtained; the spliced sequence is 'CLS' sentence 'SEP' tag 'SEP', the 'CLS' is the input text sequence, and the 'SEP' is the division symbol of the sentence and the tag.

The sentence vector is used for predicting the relation of the sentences through a feedforward neural network, namely the probability that the sample y belongs to the label L is calculated:

where θ represents the model parameters, and finally outputs a two-dimensional vector V ═ V₁,v₂]，v_iRepresents the conditional probability under label L;

normalizing the obtained two-dimensional vector, and obtaining a final result by using an indicative function I, wherein the formula is expressed as follows:

wherein k is₁Indicates the probability, k, of tag 1₂Representing the probability of tag 2.

Compared with the prior art, the invention has the beneficial effects that: mainly comprises the following aspects:

when the BERT model used by the method is proposed by Google, a large number of text data sets of Wikipedia are used for pre-training, and compared with other models, the method can reduce the steps of pre-training and reduce the complex workload;

and secondly, the performance of the BERT model on a sentence pair task is more excellent than that on a single sentence sub task, and the characteristics of the BERT model can be fully utilized by taking the semantic information of the speaking tag into consideration.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

fig. 1 is a schematic diagram of an overall framework of a BERT-based multi-label classification method provided by the present invention.

Detailed Description

For a more clear understanding of the technical features, objects and effects of the present invention, embodiments of the present invention will now be described in detail with reference to the accompanying drawings.

As shown in fig. 1, the present invention designs a BERT-based multi-label classification method, which includes:

and selecting a CAIL-2019 data set as a corpus, forming a sentence combination for each sentence and all the labels, and marking new labels for the sentence combination according to the label list of the sentences. If a certain label exists in the list, the sentence pair formed by the sentence and the label is marked as 1, and if the label does not exist, the sentence pair is marked as 0;

sequences spliced by predefined symbols [ CLS ] and [ SEP ]; wherein, the spliced sequence is a [ CLS ] original sentence sequence [ SEP ] auxiliary sentence sequence [ SEP ] ", [ CLS ] is a semantic symbol of an input text sequence, and [ SEP ] is a segmentation symbol of a problem sequence and a text segment sequence;

extracting feature word vectors of aspect words and feature word vectors of auxiliary sentences from the vectors of the text data after word segmentation, and obtaining semantically fused sentence vectors by utilizing self-attention operation;

where θ represents the model parameters, and finally outputs a two-dimensional vector V ═ V₁,v₂]Wherein，v_iRepresents the conditional probability under label L;

The loss is calculated using a cross entropy loss function and the parameters are updated.

The present invention uses a CAIL-2019 dataset selected from the legal documents published by the "network of referees of china". Each line of data in the data represents a sentence division result of a partial paragraph extracted in one referee document, and an element tag list of the sentence. The referee documents mainly relate to three fields of marital families, labor disputes and borrowing disputes, and comprise 2740 pieces, including a marital family 1269 piece, a labor dispute 836 piece and a borrowing dispute 635 piece. The data is marked by professionals with legal backgrounds, and each of the three fields has 20 element labels and the represented Chinese semantics.

The problems studied by the invention belong to classification problems, and common evaluation indexes comprise accuracy rates in the classification problems

Recall rate

F1 value

Wherein the F1 values include a micro-average F1 value

And macroaverage F1 value

A confusion matrix is needed to be used in calculation, and True Positive (TP) represents that a Positive class is predicted to be a Positive class; true Negative (TN) indicates that a Negative class is predicted as a Negative class; false Positive (FP) indicates that a negative class is predicted as a Positive class; false Negative (FN) indicates that a positive class is predicted as a Negative class. The performance criteria of the model were evaluated using a Score, calculated by "Micro _ F1" and "Macro _ F1":

the invention carries out comparison experiments on the proposed multi-label classification method, a Support Vector Machine (SVM), a TextCNN algorithm and a BERT-based multi-label classification method. The results of the experiment are shown in table 1:

TABLE 1 comparative experimental results of different multi-label classification methods

From experimental results, the model provided by the invention has the best effect among the three models.

According to the invention, the BERT model is utilized to construct sentences to classify task tasks, and the BERT model is finely adjusted, so that the effect of extracting two subtasks of tasks and emotion classification in different aspects is improved, a plurality of aspect words can be extracted, the model efficiency is improved, and the related redundant workload is reduced.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A BERT-based multi-label classification method is characterized by comprising the following steps:

2. The BERT-based multi-label classification method of claim 1, wherein, in the step of combining all text data in the dataset with different labels, each sentence in the sentence pair is combined with each label once.

3. The BERT-based multi-label classification method of claim 1, wherein in the step of tokenizing the processed text data, the stitched sequence is performed by predefined symbols [ CLS ] and [ SEP ]; the spliced sequence is 'CLS' sentence 'SEP' tag 'SEP', the 'CLS' is the input text sequence, and the 'SEP' is the division symbol of the sentence and the tag.

4. The BERT-based multi-label classification method of claim 1, wherein the sentence vectors are used to predict the sentence relationship by a feedforward neural network, i.e. to find the probability that a sample y belongs to a label L: