CN112347766A

CN112347766A - Multi-label classification method for processing microblog text cognition distortion

Info

Publication number: CN112347766A
Application number: CN202011351175.9A
Authority: CN
Inventors: 刘丰玮; 李娟�
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2020-11-27
Filing date: 2020-11-27
Publication date: 2021-02-09

Abstract

The invention discloses a multi-label classification method for processing microblog text cognition distortion, which is a text classification method based on the fusion of a BERT (belief transfer), an LSTM (local transformation) and an Attention mechanism, and is used for performing text preprocessing on a plurality of Chinese corpora in a Chinese corpus data set to obtain a plurality of sequences corresponding to the Chinese corpora; extracting word embedding of each sequence by using a BERT model; performing feature extraction on each sequence by adopting LSTM and Attention to obtain text deep semantic features corresponding to each sequence; the model is trained and tested by classifying the obtained deep semantic features of the text by using a softmax classifier, so that text classification is realized, and context information in the true sense can be captured; the method gives consideration to the context information, avoids the problem of weak historical memory caused by long-time sequences, and can effectively improve the classification effect.

Description

Multi-label classification method for processing microblog text cognition distortion

Technical Field

The present invention is in the field of computer natural language processing and cognitive distortion in depression. Some of them mainly involve multi-label, based on BERT and LSTM and attention mechanism, deep learning text classification method, etc.

Background

With the development of social networking platforms, more and more depressed patients, especially young people, have microblogs as one of the ways to express negative emotions and suicidal ideation. In recent years, more and more similar tree holes appear on the microblog. This vast amount of data includes thoughts, emotions, and even behavioral behaviors in life for patients with depression. Studies have shown that depression patients tend to be more prone to develop depressive or suicidal intent in their daily speech and microblogs. The phenomenon provides a new solution for cognitive characteristic analysis of suspected depression people, namely cognitive characteristic analysis is realized by means of microblog tree hole data. At present, the identification and labeling of the cognitive distortion of the texts of the microblog comments need to be carefully analyzed, so that the identification and labeling are mainly performed manually by experienced professionals. However, the result of manual labeling is easily affected by highly subjective factors such as the emotion and fatigue degree of the annotators, and finally, the labeling result of each annotator is often deviated to a certain degree, so that the stability is not achieved, and a large amount of labor force is wasted.

So far, the research on the causes of depression is not mature, the research has great relation with the physiology and the psychology of patients, meanwhile, the surrounding environment also has great influence on the research, and researchers in the field do a lot of work at home and abroad in the aspects of depression identification, treatment and the like. The precondition and key to the treatment of depression lies in the early recognition and cognition analysis, and if the cognition of a suspected depression patient can be analyzed quickly and effectively under the conditions of relative safety and not much privacy, the corresponding treatment measures can be implemented on the patient as quickly as possible. But currently, cognitive analysis and defense of depression patients in China are still in a lower stage. Traditional cognitive analysis is primarily identified by authoritative diagnostic criteria and psychological scales. The well-known Hamilton depression scale is often used by physicians to diagnose depressive conditions. The table scores the mood, suicidal tendency, sleep state, etc. of the patient, with higher scores indicating more severe depression.

With the development of social networks, microblog texts are classified into an important research content of natural language processing, and high attention is paid at home and abroad. The text classification technology mainly undergoes the development of three stages from statistical language to shallow machine learning and then to deep learning, wherein feature selection as an important classification link also undergoes the conversion from manual extraction to semi-manual extraction and then to automatic extraction. At present, the research in the field at home and abroad is relatively mature, and more text classification algorithms are used as a shallow machine learning algorithm and a deep learning algorithm, and the two types are commonly SVM, KNN, RNN, CNN and the like.

For training of text classifiers, the documents fed Conference on Computer Science and Information Systems teach that words are converted into vectors using Word2vec, then sets of articles representing 7 topics in wiki are classified using LSTM-based Networks, and the classification effect is compared with Convolutional Neural Networks (CNNs), with a significantly higher accuracy than CNNs. Lichao et al introduced a new fusion classification learning framework (LSTM-MFCNN) that synthesized the respective features of the CNN and LSTM models, thereby enhancing the expression of word order semantics and the mining of features. A Recurrent Neural Network (RNN) indicates the defects of the model, and introduces LSTM and GRU on the basis, constructs a bidirectional RNN model, represents Word vectors by using a Skip-gram model in Word2vec, then combines the LSTM and the GRU with the constructed bidirectional RNN respectively to extract features, and finally statistically classifies the features of text data by using a softmax classification method. Zhang D et al propose a classification method based on Word2vec and SVMperf, firstly use Word2vec to cluster similar features, and verify the ability of Word2vec to extract and mine semantic features from specific data in different subject fields and Chinese corpus. And then, vectorizing and classifying the existing comment text by using a word2vec method and an SVMperf method. The method combines the advantages of the CNN and the RNN, and provides a new unified sentence representation and text classification model C-LSTM. Yangs et al propose a multi-layer neural network model (BilSTM-CNN) using LSTM and CNN models, and use Bi-LSTM to process text data and CNN to process the output of the upper layer to obtain classification results, and compare the results through experiments, the experimental effect of the model is obviously superior to that of a single-layer network model and a traditional statistical method. Wang JH et al studied the effect of word embedding and LSTM on text classification, and then combined LSTM and word embedding improved the short text classification algorithm. Nowak J et al propose bidirectional LSTM through the improvement of LSTM, and compare with the study model of the multi-label short text classification method based on deep learning of the bag of words, the effect is obvious.

Disclosure of Invention

Based on the analysis, the invention mainly designs a deep learning multi-label text classification method for performing cognitive distortion analysis on texts. The text semantic representation method is developed from an initial One-Hot representation method to a current mainstream method based on a neural network, such as Word2Vec, Glove and the like, and although the problem of Word context relationship is solved to a certain extent, the polysemous problem that words have different meanings in different contexts is not solved yet. According to the method, BERT is used as a language feature extraction and representation method, not only can rich grammatical and semantic features of a microblog comment text be obtained, but also the problem that a traditional language feature representation method based on a neural network structure ignores word ambiguity is solved.

Firstly, crawling microblog comment content, asking relevant experts to label relative cognitive distortion, and then carrying out next-step processing on data to obtain a format required by a code. Then, the data set is divided into a training set, a verification set and a test set, the key is that corresponding labels are uniform, the labels are divided into three data sets, and then the three data sets are out of order.

The data flow then precedes BERT. BERT, which was proposed in 2018 by the google team as Devlin, is applicable to a variety of natural language processing tasks. BERT adopts a Transformer language model, the model is of an Encoder-Decoder structure, a recursive structure is abandoned, and attention mechanism is adopted to mine the relation between input and output. The structure of the Transformer model is shown in FIG. 1. The structure of the BERT model is shown in FIG. 2.

The data was processed by BERT and then LSTM. The hidden layer representation of the bidirectional long-and-short time memory network (BilSTM) is formed by splicing the output transmitted forward and the output transmitted backward by the LSTM. It adds information from the postamble to the time step of the LSTM through the flow of information conducted in both directions. Using LSTM not only utilizes gating to filter information, but also enables network to learn long distance inter-phrase relationship.

A multi-label classification method for processing microblog text cognition distortion comprises the following steps:

step 1, crawling a comment data set under a microblog meal carrier, roughly cleaning the data set (removing some junk information such as advertisements) and then carrying out professional labeling on data in the data set.

And 2, randomly dividing the microblog text data set into a training set, a verification set and a test set, and converting excel into a tsv file according to format requirements.

And 3, constructing a model, namely pre-training by using a BERT model, then training an LSTM model, adding an Attention layer before an output layer of a normal BilTM model, adopting an Attention mechanism, and having the core that an Attention vector is generated, updating the weighted value of each dimension by performing similarity calculation with an input vector, so that the value of a key word in a sentence is improved, the model focuses Attention on the key word, the action of other irrelevant words is reduced, and the precision of text classification is further improved.

Step 3.1, performing word coding on the text processed in the step 2, namely obtaining a word vector of the text through a BERT model;

and 3.2, processing the vector obtained in the step 3.1 by utilizing dropout. And inputting the output vector into the hidden layer, calculating the hidden layer vector sequence from one direction by using a standard LSTM, calculating the two-way LSTM on the hidden layers in two different directions, and finally merging and outputting the results in the two directions.

And 3.3, adding an attention mechanism before outputting, distributing different weights and bias terms for each output value of the BilSTM, and calculating the weight score of each word in each text by using a torch.

Step 3.4, the results of the previous step are normalized by softmax.

And 3.5, calculating the score condition of each classification according to the calculated feature vector and outputting.

And 4, training the model constructed in the step 3, and selecting the model which best appears on the test set as a result after the training is finished.

Compared with the prior art, the invention has the following advantages:

the invention is based on a text classification method of integration of BERT, LSTM and Attention mechanisms, and text preprocessing is carried out on a plurality of Chinese corpora in a Chinese corpus data set to obtain a plurality of sequences corresponding to the Chinese corpora; extracting word embedding of each sequence by using a BERT model; performing feature extraction on each sequence by adopting LSTM and Attention to obtain text deep semantic features corresponding to each sequence; and training and testing the model by classifying the obtained deep semantic features of the text by using a softmax classifier, so as to realize text classification. Meanwhile, the model has the following specific advantages:

(1) word embedding is extracted through a BERT model, the process of pre-training by a word2vec algorithm in the prior art is replaced, and the BERT model is used as a bidirectional deep system and can capture context information in the true sense;

(2) based on the LSTM, past and future information can be acquired, and context information is considered.

(3) The addition of the Attention mechanism can highlight important features, avoid the problem of weakened historical memory caused by long-time sequences, and effectively improve the classification effect.

Drawings

FIG. 1 is a diagram of a Transformer model.

Figure 2 is a BERT model.

Fig. 3 is a general structure diagram of the proposed model of the present invention.

Detailed Description

The present invention will be described in further detail below with reference to specific embodiments and with reference to the attached drawings.

The invention provides a multi-label classification method for processing microblog text cognition distortion, which specifically comprises the following steps:

the hardware equipment used by the invention comprises 1 PC and 1 1080 video card;

for convenience of description, the related terms appearing in the detailed description are explained:

(1) BERT (Bidirectional Encoder reproduction from transformations): a bi-directionally encoded representation of a transform;

(2) LSTM (Long Short-Term Memory): a long-time and short-time memory network model;

(3) and (6) Attention: an attention mechanism;

And 2.1, reading the file content through a java FileReader to generate a tsv text file, wherein the front is sensor, the rear is label, the label is seized, the middle is separated by \ t.

And 2.2, when the file is generated, 60 percent of the data quantity is taken as a training set, 20 percent is taken as a verification set, and 20 percent is taken as a test set, so that the data of each label is uniformly distributed in the three verification sets as much as possible.

And 2.3, after the three tsv files are generated, disordering the data by using a shuffle method of java's Collections.

Step 3.1, inputting an original text data set T, and preprocessing the text data to obtain a text data set T'; wherein T ═ { T ═ T₁，t₂，...，t_i，...，t_len(T)Len (T) is the number of texts in T, T_iIs the ith text in T, T '═ T'₁，t′₂，...，t′_j，...，t′_len(T′)Len (T ') is the number of text contents in T', T_j' is the jth text in T ', and unifies the text tj ' into a fixed length seq _ len;

3.2, vectorizing the text data set T' by using the pre-trained BERT model;

and 3.3, processing the vector obtained in step 3.2 by utilizing dropout. dropout refers to temporarily discarding a neural network unit from a network according to a certain probability in the training process of a deep learning network. The purpose is to prevent overfitting. This procedure yields a three-dimensional matrix with the shape [ batch _ size, seq _ len, bert _ dim ].

And 3.4, inputting the 3.3 output vector into the hidden layer by the model, calculating the hidden layer vector sequence from one direction by a standard LSTM, calculating the two-way LSTM on the hidden layers in two different directions, and finally merging and outputting the results in the two directions. Output a three-dimensional matrix whose shape is [ batch _ size, seq _ len, rnn _ hidden _ size x 2] (where rnn _ hidden _ size is the hidden feature dimension, i default to 300, so the third dimension is rnn _ hidden _ size x 2 because it is bi-directional).

Step 3.5, adding an attention mechanism before outputting, assigning different weights and bias terms to each output value of the BilSTM, calculating the weight score of each word in the text by a torch.

And 3.6, normalizing the previous result by softmax, and calculating a weight at each moment to obtain a matrix of [ batch _ size, seq _ len,1 ].

And 3.7, combining the hidden dot obtained in the step 3.4 with the matrix obtained in the step 3.6, and performing torech.sum operation to obtain feat: [ batch _ size, rnn _ hidden _ size × 2], namely the extracted feature vector.

And 3.8, calculating the score condition of each classification according to the calculated feature vector and outputting.

Step 4.1, the model data input table is a 14-dimensional vector, 0 represents no label, and 1 represents the label (14-dimensional because the model data input table is roughly fourteen types according to the type of cognitive distortion).

And 4.2, calculating loss function loss through the nn.BCEWithLoitsLoss () function according to the score conditions logits obtained in the step three.

Step 4.3, if the gradient accumulation is greater than 1, then the correction is made by dividing loss by this number.

And 4.4, outputting the model with the best result as the training result through the test in the test set.

The above embodiments are only exemplary embodiments of the present invention, and are not intended to limit the present invention, and the scope of the present invention is defined by the claims. Various modifications and equivalents may be made by those skilled in the art within the spirit and scope of the present invention, and such modifications and equivalents should also be considered as falling within the scope of the present invention.

Claims

1. A multi-label classification method for processing microblog text cognition distortion is characterized by comprising the following steps: the method comprises the following steps of,

step 1, crawling a comment data set under a microblog meal carrier, firstly carrying out rough cleaning on the data set, and then carrying out professional labeling on data in the data set;

step 2, randomly dividing a microblog text data set into a training set, a verification set and a test set, and converting excel into a tsv file according to format requirements;

step 3, constructing a model, namely pre-training by using a BERT model, then training an LSTM model, adding an Attention layer before an output layer of a normal BilTM model, adopting an Attention mechanism, and having the core that an Attention vector is generated, updating the weight value of each dimension by performing similarity calculation with an input vector, so that the value of key words in a sentence is improved, the model focuses Attention on the key words, the action of other irrelevant words is reduced, and the precision of text classification is further improved;

and 4, training the BERT model constructed in the step 3, and selecting the model on the test set as a result after the training is finished.

2. The multi-label classification method for processing microblog text recognition distortion according to claim 1, wherein the multi-label classification method comprises the following steps: in step 2, step 2.1, reading the file content through a java FileReader to generate a tsv text file, wherein the front is a sensor, the rear is a label, the label is taken, the separation is carried out, and the middle is separated by t;

step 2.2, when the document is generated, 60 percent of the data quantity is taken as a training set, 20 percent is taken as a verification set, and 20 percent is taken as a test set, so that the data of each label is uniformly distributed in the three verification sets;

3. The multi-label classification method for processing microblog text recognition distortion according to claim 1, wherein the multi-label classification method comprises the following steps: in step 3, step 3.1, performing word coding on the text processed in step 2, namely obtaining a word vector of the text through a BERT model;

step 3.2, processing the vector obtained by the step 3.1 by utilizing dropout; inputting the output vector into a hidden layer, calculating a hidden layer vector sequence from one direction by a standard LSTM, calculating the two-way LSTM on hidden layers in two different directions, and finally merging and outputting results in the two directions;

step 3.3, adding an attention mechanism before outputting, distributing different weights and bias terms for each output value of the BilSTM, and calculating the weight score of each word in each text by a torch.

Step 3.4, normalizing the result of the previous step by softmax;

4. The multi-label classification method for processing microblog text recognition distortion according to claim 1, wherein the multi-label classification method comprises the following steps: in step 4, step 4.1, the model data input table is a 14-dimensional vector, 0 represents that the label is absent, and 1 represents that the label is present;

4.2, calculating loss function loss through nn.BCEWithLoitsLoss () function of the score situation logits obtained in the step 3;

step 4.3, if the gradient accumulation is greater than 1, dividing the loss by the number for correction;