CN116796740A

CN116796740A - Bad information identification method based on textCNN-Bert fusion model algorithm

Info

Publication number: CN116796740A
Application number: CN202310832134.9A
Authority: CN
Inventors: 裴卓雄; 杨婧; 殷伟; 杨敏
Original assignee: National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2023-07-07
Filing date: 2023-07-07
Publication date: 2023-09-22

Abstract

The invention provides a bad information identification method based on a textCNN-Bert fusion model algorithm, belonging to the technical field of bad information identification based on a model algorithm; the technical problems to be solved are as follows: the method for identifying the bad information by adopting the textCNN-Bert fusion model algorithm is improved; the technical scheme adopted for solving the technical problems is as follows: performing preprocessing of word segmentation, part-of-speech tagging and stop word removal on a text to be recognized, and inputting the preprocessed text into a fusion model according to a sequence for recognition; inputting the preprocessed text into a sensitive field topic identification module in the fusion model for processing: if the topic of the sensitive field is identified as false, judging that the topic is irrelevant to the sensitive field, and outputting the topic as general text information; if the sensitive field theme is identified as true, inputting an emotion metaphor identification module for further judgment; the method and the device are applied to bad information identification.

Description

Bad information identification method based on textCNN-Bert fusion model algorithm

Technical Field

The invention provides a bad information identification method based on a textCNN-Bert fusion model algorithm, and belongs to the technical field of bad information identification based on a model algorithm.

Background

With the vigorous development of the Internet industry, the flooding of bad information on the Internet causes a plurality of social problems, especially bad information in the sensitive field, has extremely strong confusion and deception by arranging, tampering, dummies and forging, corrodes the ideas of people, influences the value and judgment capability of people and jeopardizes the social security. The text is used as a main propagation mode, and the examination of the corresponding content is mainly or manually performed at present, so that the examination efficiency is low, the error rate is high, and misjudgment is easy to occur. The identification technology for researching the bad information in the sensitive field has profound significance.

For examination and identification of bad contents, a natural language processing technology (Natural Language Processing, NLP) adopted at present can carry out deep analysis and understanding on texts, so that classification and identification of the texts are realized. A convolutional neural network model TextCNN for text classification is proposed by Kim, y, which has the advantage that it can avoid the problem of gradient extinction to some extent and performs well when processing short text and fixed length text. Lai, S provides a text classification model RCNN, and combines the advantages of a convolutional neural network and a cyclic neural network. Wang compared the performance of different recurrent neural network models in text classification tasks, indicating the advantages of LSTM models in text classification. Devlin, J proposes BERT model, which is a pre-trained model based on a Transformer network for performing natural language processing tasks such as text classification, language inference, etc. Zhang proposes a bidirectional emotion expression symbol embedding-based and attention-based LSTM information emotion analysis method, which uses bidirectional LSTM to learn context information in sentences, uses an attention mechanism to strengthen attention to important information, and uses emotion expression symbols to strengthen emotion classification accuracy. Chen uses the BERT model to text categorize financial news, while using fine tuning techniques, the BERT model is tuned to suit the particular nature of the financial news. Rehman, A.U. has proposed a kind of mixed model of CNN-LSTM, is used for improving the accuracy of film comment emotion analysis, and this model utilizes CNN to draw the local characteristic, and LSTM is then used for learning sequence information to combine the advantage of two kinds of models.

Word vector technology is a technology that represents words or phrases in text as vectors; the first step in implementing text classification based on NLP technology is to represent text using word vectors, and conventional NLP methods are based on discrete symbolic representations, i.e., each word is represented as a unique identifier or index, which does not take into account the semantic relationship between words, and therefore cannot capture the similarity and correlation between words. The word vector technology enables the words similar in terms to be closer in the vector space by representing each word as a vector, so that the semantic relation among the words can be better captured. The word2vec model core idea is to represent each word as a vector, and measure the similarity between words by calculating cosine similarity between word vectors. GloVe is a word vector learning method based on global word frequency statistics, and converts co-occurrence information of words into distance relations in a vector space. The core idea of ELMo is to generate context-dependent word vector representations by training deep bi-directional language models, which has the advantage of capturing semantic and grammatical information of words in different contexts, thereby improving the performance of natural language processing tasks.

Based on the theoretical research, the sensitive field belongs to the professional field, the research of the identification technology of bad information is very limited, and the general identification technology can be directly applied to identification, but has the following problems: one is a domain-specific language and terminology problem. Sensitive fields have a rich set of domain-specific languages and terms that may not be easily understood by the generic model, resulting in reduced text recognition accuracy. And secondly, the background knowledge problem. The sensitive domain involves knowledge of sensitive events, characters and backgrounds, which may be unknown to the model, and require special processing to identify and understand. Third, text complexity problem: the text in the sensitive field is very complex and contains a large number of metaphors, metaphors and extension meanings, which all require the model to have the ability of recognition and understanding, but the algorithm model provided at present can not well solve the problems, and further improvement on a text recognition scheme for bad information is required.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and solves the technical problems that: the method for identifying the bad information by adopting the textCNN-Bert fusion model algorithm is improved.

In order to solve the technical problems, the invention adopts the following technical scheme: a bad information identification method based on a textCNN-Bert fusion model algorithm comprises the following information identification steps:

step one: preprocessing the text to be recognized by word segmentation, part-of-speech tagging and removal of stop words, and preprocessing the text according to a sequence X= { X ₁ ，x ₂ ，x ₃ ，…x _n Inputting a textCNN-Bert fusion model into the model (n is more than 0) for recognition processing;

step two: inputting the preprocessed text into a sensitive field topic identification module in a textCNN-Bert fusion model for processing:

if the topic of the sensitive field is identified as false, judging that the topic is irrelevant to the sensitive field, and outputting the topic as non-sensitive text information;

if the sensitive field theme is identified as true, inputting a sensitive emotion metaphor identification module for further judgment;

the specific steps of the sensitive field theme identification module for identifying the preprocessed text are as follows:

step 2.1: establishing a sensitive field topic recognition model, and performing fine adjustment on the Word2Vec model which is trained on a Word stock of the sensitive field to obtain Word vectors which are more suitable for the field;

step 2.1.1: preparing sensitive domain corpus and public large-scale corpus;

step 2.1.2: inputting the large-scale public corpus into a universal Word2Vec model for training to obtain universal Word vector representation content;

step 2.1.3: constructing a domain word stock based on the technical terms and common words of the acquired sensitive domain;

step 2.1.4: performing fine tuning update on the word vectors related to the field;

step 2.2: inputting the word vector with fine adjustment update into a TextCNN convolutional neural network model for further processing:

inputting word vectors into an input layer of a first layer of a model, converting the word vectors into word embedded vectors by receiving input text sequences, wherein each word corresponds to a vector, forming a matrix by the vectors according to a sequence order, and then inputting the vectors into a second layer;

the input second layer is a convolution layer, the input text matrix is subjected to convolution operation through a plurality of convolution cores with different sizes, so that local characteristics of the text are extracted, and then the third layer is input;

the third layer is a pooling layer used for compressing the dimension of the feature map and extracting important features, and then a fourth layer is input;

the fourth layer of input is a full connection layer, the output of the pooling layer is connected to one or more full connection layers for learning the relation between features and carrying out final classification, and finally the output layer is entered, and the output result is two categories of sensitive field text and non-sensitive field text;

step three: inputting the preprocessed text into an emotion metaphor recognition module in a textCNN-Bert fusion model for processing:

if the emotion metaphor is identified as true, judging that the emotion metaphor is sensitive text information and outputting;

if the emotion metaphor is identified as false, judging that the emotion metaphor is non-sensitive text information and outputting the non-sensitive text information;

step four: and outputting a judging result of the bad information based on the judging made in the second step and the third step.

The specific method for constructing the domain word stock in the step 2.1.3 comprises the following steps:

the method for constructing the sensitive domain word stock based on the TF-POS algorithm is adopted, the domain word is obtained in a mode of counting word frequency and part of speech analysis, the frequency of occurrence of one word in the sensitive domain text is analyzed and calculated, the correlation between the word and sensitive transaction is judged, and the calculation formula of counting word frequency is as follows:

and the expression specifying the part of speech of the field is:

POS＝{nr,ns,nt,nz,j}；

wherein: nr represents a person name, ns represents a place name, nt represents an organization group name, and j represents an abbreviation.

The specific steps of the emotion metaphor recognition module for recognizing the preprocessing text in the step three are as follows:

step 3.1: identifying semantic metaphors in the sensitive text by adopting a Bert pre-training voice model, and specifically inputting a pre-processed text of the content to be identified to the input end of the Bert model, wherein the expression is as follows:

X＝{x1,x2,x3,…xn}(n≥0)；

the identification process of the model is as follows:

step 3.1.1: inputting a sequence word vector of the text;

step 3.1.2: extracting semantic information in a text through a transform coding layer, wherein the layer consists of a plurality of Transformer block blocks, and each block consists of a multi-head self-attention mechanism and a feedforward neural network;

step 3.1.3: extracting deep semantic information through a Bert pre-training task layer;

step 3.1.4: classifying texts through a softMax function, and outputting a judging result as two labels of bad information and general information;

step 3.2: adopting a Bert model fine tuning step, further training the model to adapt to specific tasks on the basis of a pre-training stage, specifically inputting non-sensitive field text, sensitive field general information and sensitive field bad information as a training set and a testing set, training and tuning the model according to a loss function and an evaluation index, specifically adopting a cross entropy loss function when carrying out model fine tuning, and adopting a calculation formula as follows:

wherein: y is _i The label representing sample i has a positive class of 1, a negative class of 0, p _i Representing the probability that sample i is predicted to be a positive class.

In step 3.1, the specific steps of pre-training the Bert model include an MLM phase and an NSP phase:

in the MLM phase, a piece of text is entered, in particular by the BERT model, and some of the words therein are replaced with [ Mask ] or other random words, the object of the model being to predict these replaced words;

in the NSP phase, two sentences are input specifically by the BERT model, and whether the two sentences are continuous is predicted, so that the model understands the relationship between the two sentences.

Compared with the prior art, the invention has the following beneficial effects: the invention specifically converts the problem of identifying bad information in the sensitive field into a task for identifying a theme in the sensitive field and a task for identifying emotion metaphors, and provides a scheme for identifying bad information in the sensitive field based on a textCNN-Bert fusion model, which not only utilizes the advantage that a textCNN model is more sensitive to keywords and local features to be identified, but also can accurately identify specific languages and terms in the sensitive field, and can also utilize the pretraining capacity and self-attention mechanism of the Bert model to promote the metaphors, metaphors and inheritance in a target sensitive information text to be identified; in addition, the scheme also combines the vocabulary characteristics of the sensitive field, adopts a sensitive field word stock construction algorithm based on a TF-POS algorithm, and recognizes and acquires the professional vocabulary of the sensitive field in a way of counting word frequency and part-of-speech analysis, thereby having obvious advantages in the aspects of accuracy, precision, recall rate and the like compared with the existing algorithm model.

Drawings

The invention is further described below with reference to the accompanying drawings:

FIG. 1 is a flow chart of steps for identifying bad information based on a fusion model algorithm;

FIG. 2 is a flowchart illustrating steps for sensitive domain topic identification in accordance with the present invention;

FIG. 3 is a schematic diagram of a textCNN convolutional neural network model employed in the present invention;

FIG. 4 is a flowchart illustrating steps for emotion metaphor recognition in accordance with the present invention.

Detailed Description

The invention particularly provides a sensitive information identification method based on a textCNN-Bert fusion model algorithm, which adopts a natural language processing technology (Natural Language Processing, NLP) to train a deep learning model, can realize the identification of sensitive text information, and relates to technical means such as corpus construction, word vector representation, language model construction, text classification and the like in the model training process.

The textCNN-Bert fusion model provided by the invention is shown in figure 1, and the model is input as a preprocessed text sequence X= { X ₁ ，x ₂ ，x ₃ ，…x _n The preprocessing process comprises word segmentation, part-of-speech tagging and stop word removal, and a judgment result of sensitive text information is output; the recognition model comprises two modules of sensitive field topic recognition and sensitive emotion metaphor recognition:

if the sensitive domain theme is identified as false, judging that the sensitive domain theme is irrelevant to the sensitive domain and is general text information;

if the sensitive field topic is identified as true, judging as an input of emotion metaphor identification;

if the emotion metaphor is recognized as true, judging as bad text information;

if the emotion metaphor is recognized as false, the emotion metaphor is determined as general information of the sensitive area.

Establishing the sensitive field topic identification model comprises the steps of adopting word bank fine tuning and data processing of a neural network:

word2Vec feature domain Word stock fine tuning is adopted, namely a Word2Vec model which is trained is fine tuned on a Word stock in a specific domain to obtain Word vectors which are more suitable for the domain, as shown in fig. 2, the specific process is as follows:

firstly, preparing sensitive domain corpus and public large-scale corpus, secondly, inputting the large-scale public corpus into a universal Word2Vec model for training to obtain universal Word vector representation content, then, constructing a domain lexicon based on the obtained professional terms and common vocabulary of the sensitive domain, and finally, carrying out fine adjustment updating on Word vectors related to the domain.

Furthermore, the invention combines the characteristics of the vocabulary in the sensitive field, adopts a sensitive field word stock construction method based on a TF-POS algorithm, and particularly obtains the vocabulary in the field by counting word frequency and part of speech analysis, and comprises the following steps:

analyzing and calculating the occurrence frequency of a vocabulary in a text in a sensitive field, wherein the occurrence frequency is an important characteristic for judging the correlation between the vocabulary and a sensitive transaction, and the calculation formula for counting the word frequency is as follows:

the expression of the part of speech of the appointed field is POS= { nr, ns, nt, nz, j }, wherein: nr represents a person name, ns represents a place name, nt represents an organization group name, j represents an abbreviation, and the above-mentioned parameter information such as person, organization, event, time, place, etc. has a special meaning in the sensitive field, and therefore, needs to be individually specified.

The adopted convolutional neural network textCNN structure is shown in figure 3, the first layer of the textCNN convolutional neural network model is an input layer and is used for receiving an input text sequence, converting the input text sequence into word embedding vectors, each word corresponds to a vector, and forming a matrix by the vectors according to the sequence order;

the second layer is a convolution layer, and the convolution operation is carried out on the input text matrix through a plurality of convolution cores with different sizes, so that the local characteristics of the text are extracted.

The third layer of input is the pooling layer for compressing the dimensions of the feature map and extracting important features.

The fourth layer of inputs is a fully connected layer, connecting the outputs of the pooling layer to one or more fully connected layers for learning relationships between features and final classification.

And finally, entering an output layer, wherein the output result is two categories of sensitive field text and non-sensitive field text.

When the emotion metaphor recognition model is adopted for analysis processing of sensitive information, because part of bad information expression content is hidden and the package with hidden and confusing properties is arranged outside, the package has extremely strong confusing property with normal text content, and therefore, the key of accurately recognizing the bad information is whether the metaphor of the semantics in the sensitive text can be recognized, and BERT (Bidirectional Encoder Representations from Transformers) adopted by the invention is a pretrained natural language processing model and is applicable to the recognition and understanding of the semantic metaphor.

As shown in fig. 4, a preprocessing text of the content to be identified is specifically input to an input end of the BERT model, where the expression is:

X＝{x1,x2,x3,…xn}(n≥0)；

the output end of the model is the judging result.

The identification process of the model is as follows:

the first step is to input a text sequence word vector.

The second step is to extract semantic information in the text through a transform coding layer consisting of a plurality of Transformer block blocks, each consisting of a multi-headed self-attention mechanism and a feed-forward neural network.

And thirdly, extracting deep semantic information through a Bert pre-training task layer.

And fourthly, classifying texts through a softMax function, and outputting the texts into two labels, namely bad information and general information.

The emotion metaphor recognition model adopted by the invention is obtained through two steps of Bert pre-training and Bert model fine tuning.

The process of pre-training using the BERT model is divided into two phases, masked Language Model (MLM) and Next Sentence Prediction (NSP), respectively:

wherein in the MLM phase, BERT enters a piece of text and replaces some of the words therein with [ Mask ] or other random words, the goal of the model being to predict these replaced words;

wherein in the NSP phase the BERT inputs two sentences and predicts whether the two sentences are consecutive. The purpose of this task is to let the model understand the relationship between two sentences.

The invention adopts an open source pre-training model BERT-base-Chinese issued by a Hugging Face, which is a BERT model trained based on uncased data sets, comprises 12 layers, 768 hidden units and 12 attention heads, and is suitable for tasks such as Chinese text classification and the like.

The invention inputs the news data, the text data in the sensitive field and the bad information in the sensitive field to be used as a training set and a testing set, trains and adjusts the model according to a loss function and an evaluation index, and specifically adopts a cross entropy loss function when the model is adjusted, wherein the expression is as follows:

According to experimental data of the embodiment, the textCNN-Bert fusion model provided by the invention aiming at sensitive information identification is superior to the currently used model such as TextCNN, LSTM, bert class model in various evaluation indexes, and specific experimental details are as follows:

when experimental data are collected, three partial data sets are mainly collected:

the first part of the experimental dataset collected was: the data are derived from the whole network news data of a dog searching laboratory, and 10 categories of data including automobiles, science, technology, health, sports, real estate, education, travel, culture, IT and fashion are screened, wherein each category comprises about 2000 texts;

the second portion of the experimental dataset collected was: sensitive domain data;

the third portion of the experimental dataset collected was: poor information in sensitive areas.

The statistical data set is divided into the following cases through manual processing and labeling: the method comprises the steps of 82 ten thousand sentences of non-sensitive field data, 73 ten thousand sentences of sensitive field general information data and 78 ten thousand sentences of sensitive field bad information data, and meanwhile, dividing a labeling data set into a training set, a verification set and a test set according to the ratio of 6:2:2.

In order to verify the effectiveness of the text recognition method of the bad information based on the textCNN-Bert fusion model, textCNN, LSTM, BERT is selected as a baseline model, and experimental environments and model parameters are set as shown in tables 1 and 2:

TABLE 1TextCNN model parameters

TABLE 2Bert model parameters

In the index evaluation, the evaluation indexes adopted in the experiment include: accuracy Accuracy, accuracy Precision, recall rate Recall and F1-score values;

the accuracy rate refers to the proportion of all predicted positive classes to the total number, and the calculation expression is as follows:

the recall rate refers to the proportion of all correctly predicted positive classes to all actual positive classes, and the calculation expression is:

the precision refers to the proportion of samples which are predicted to be positive, and the actual positive samples occupy the calculation expression is as follows:

the F1 value integrates the accuracy and the recall, the weights of Pre and Rec are regarded as the same, the weights are based on the harmonic average of the two, the weights are generally used as a comprehensive evaluation index, the higher the F1 value is, the better the performance of the representative model is, and the calculation expression is as follows:

based on the above calculation data, comparison data of the recognition effect of each model is shown in table 3:

table 3 comparison data of model identification effect

As can be seen from the above comparison data, the textCNN-Bert fusion model provided by the invention is superior to the class classification model TextCNN, LSTM, bert in terms of evaluation indexes, and the Precision values of the compared textCNN and LSTM are obviously lower than other indexes, because each independent model cannot understand deep semantics, so that the general information of the sensitive field is judged to be bad information, the Precision value is lower, and the single-use Bert model index is lower than the fusion model used by the invention, because the single-use Bert model index is insensitive to the special vocabulary in the sensitive information field, and cannot make correct identification judgment, so that part of irrelevant contents are also judged to be bad information of the sensitive field.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. A bad information identification method based on a textCNN-Bert fusion model algorithm is characterized by comprising the following steps of: the method comprises the following information identification steps:

step two: inputting the preprocessed text into a sensitive field topic identification module in a textCNN-Bert fusion model for processing: if the topic of the sensitive field is identified as false, judging that the topic is irrelevant to the sensitive field, and outputting the topic as non-sensitive text information;

step 2.1.1: preparing sensitive domain corpus and public large-scale corpus;

2. The method for identifying the bad information based on the TextCNN-Bert fusion model algorithm according to claim 1, wherein the method comprises the following steps: the specific method for constructing the domain word stock in the step 2.1.3 comprises the following steps:

and the expression specifying the part of speech of the field is:

POS＝{nr,ns,nt,nz,j}；

3. The method for identifying the bad information based on the TextCNN-Bert fusion model algorithm according to claim 1, wherein the method comprises the following steps: the specific steps of the emotion metaphor recognition module for recognizing the preprocessing text in the step three are as follows:

X＝{x1,x2,x3,…xn}(n≥0)；

the identification process of the model is as follows:

step 3.1.1: inputting a sequence word vector of the text;

4. The method for identifying the bad information based on the TextCNN-Bert fusion model algorithm according to claim 3, wherein the method comprises the following steps of: in step 3.1, the specific steps of pre-training the Bert model include an MLM phase and an NSP phase: