CN116796740A - Bad information identification method based on textCNN-Bert fusion model algorithm - Google Patents

Bad information identification method based on textCNN-Bert fusion model algorithm Download PDF

Info

Publication number
CN116796740A
CN116796740A CN202310832134.9A CN202310832134A CN116796740A CN 116796740 A CN116796740 A CN 116796740A CN 202310832134 A CN202310832134 A CN 202310832134A CN 116796740 A CN116796740 A CN 116796740A
Authority
CN
China
Prior art keywords
text
model
sensitive
bert
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310832134.9A
Other languages
Chinese (zh)
Inventor
裴卓雄
杨婧
殷伟
杨敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Computer Network and Information Security Management Center
Original Assignee
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Computer Network and Information Security Management Center filed Critical National Computer Network and Information Security Management Center
Priority to CN202310832134.9A priority Critical patent/CN116796740A/en
Publication of CN116796740A publication Critical patent/CN116796740A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a bad information identification method based on a textCNN-Bert fusion model algorithm, belonging to the technical field of bad information identification based on a model algorithm; the technical problems to be solved are as follows: the method for identifying the bad information by adopting the textCNN-Bert fusion model algorithm is improved; the technical scheme adopted for solving the technical problems is as follows: performing preprocessing of word segmentation, part-of-speech tagging and stop word removal on a text to be recognized, and inputting the preprocessed text into a fusion model according to a sequence for recognition; inputting the preprocessed text into a sensitive field topic identification module in the fusion model for processing: if the topic of the sensitive field is identified as false, judging that the topic is irrelevant to the sensitive field, and outputting the topic as general text information; if the sensitive field theme is identified as true, inputting an emotion metaphor identification module for further judgment; the method and the device are applied to bad information identification.

Description

Bad information identification method based on textCNN-Bert fusion model algorithm
Technical Field
The invention provides a bad information identification method based on a textCNN-Bert fusion model algorithm, and belongs to the technical field of bad information identification based on a model algorithm.
Background
With the vigorous development of the Internet industry, the flooding of bad information on the Internet causes a plurality of social problems, especially bad information in the sensitive field, has extremely strong confusion and deception by arranging, tampering, dummies and forging, corrodes the ideas of people, influences the value and judgment capability of people and jeopardizes the social security. The text is used as a main propagation mode, and the examination of the corresponding content is mainly or manually performed at present, so that the examination efficiency is low, the error rate is high, and misjudgment is easy to occur. The identification technology for researching the bad information in the sensitive field has profound significance.
For examination and identification of bad contents, a natural language processing technology (Natural Language Processing, NLP) adopted at present can carry out deep analysis and understanding on texts, so that classification and identification of the texts are realized. A convolutional neural network model TextCNN for text classification is proposed by Kim, y, which has the advantage that it can avoid the problem of gradient extinction to some extent and performs well when processing short text and fixed length text. Lai, S provides a text classification model RCNN, and combines the advantages of a convolutional neural network and a cyclic neural network. Wang compared the performance of different recurrent neural network models in text classification tasks, indicating the advantages of LSTM models in text classification. Devlin, J proposes BERT model, which is a pre-trained model based on a Transformer network for performing natural language processing tasks such as text classification, language inference, etc. Zhang proposes a bidirectional emotion expression symbol embedding-based and attention-based LSTM information emotion analysis method, which uses bidirectional LSTM to learn context information in sentences, uses an attention mechanism to strengthen attention to important information, and uses emotion expression symbols to strengthen emotion classification accuracy. Chen uses the BERT model to text categorize financial news, while using fine tuning techniques, the BERT model is tuned to suit the particular nature of the financial news. Rehman, A.U. has proposed a kind of mixed model of CNN-LSTM, is used for improving the accuracy of film comment emotion analysis, and this model utilizes CNN to draw the local characteristic, and LSTM is then used for learning sequence information to combine the advantage of two kinds of models.
Word vector technology is a technology that represents words or phrases in text as vectors; the first step in implementing text classification based on NLP technology is to represent text using word vectors, and conventional NLP methods are based on discrete symbolic representations, i.e., each word is represented as a unique identifier or index, which does not take into account the semantic relationship between words, and therefore cannot capture the similarity and correlation between words. The word vector technology enables the words similar in terms to be closer in the vector space by representing each word as a vector, so that the semantic relation among the words can be better captured. The word2vec model core idea is to represent each word as a vector, and measure the similarity between words by calculating cosine similarity between word vectors. GloVe is a word vector learning method based on global word frequency statistics, and converts co-occurrence information of words into distance relations in a vector space. The core idea of ELMo is to generate context-dependent word vector representations by training deep bi-directional language models, which has the advantage of capturing semantic and grammatical information of words in different contexts, thereby improving the performance of natural language processing tasks.
Based on the theoretical research, the sensitive field belongs to the professional field, the research of the identification technology of bad information is very limited, and the general identification technology can be directly applied to identification, but has the following problems: one is a domain-specific language and terminology problem. Sensitive fields have a rich set of domain-specific languages and terms that may not be easily understood by the generic model, resulting in reduced text recognition accuracy. And secondly, the background knowledge problem. The sensitive domain involves knowledge of sensitive events, characters and backgrounds, which may be unknown to the model, and require special processing to identify and understand. Third, text complexity problem: the text in the sensitive field is very complex and contains a large number of metaphors, metaphors and extension meanings, which all require the model to have the ability of recognition and understanding, but the algorithm model provided at present can not well solve the problems, and further improvement on a text recognition scheme for bad information is required.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, and solves the technical problems that: the method for identifying the bad information by adopting the textCNN-Bert fusion model algorithm is improved.
In order to solve the technical problems, the invention adopts the following technical scheme: a bad information identification method based on a textCNN-Bert fusion model algorithm comprises the following information identification steps:
step one: preprocessing the text to be recognized by word segmentation, part-of-speech tagging and removal of stop words, and preprocessing the text according to a sequence X= { X 1 ,x 2 ,x 3 ,…x n Inputting a textCNN-Bert fusion model into the model (n is more than 0) for recognition processing;
step two: inputting the preprocessed text into a sensitive field topic identification module in a textCNN-Bert fusion model for processing:
if the topic of the sensitive field is identified as false, judging that the topic is irrelevant to the sensitive field, and outputting the topic as non-sensitive text information;
if the sensitive field theme is identified as true, inputting a sensitive emotion metaphor identification module for further judgment;
the specific steps of the sensitive field theme identification module for identifying the preprocessed text are as follows:
step 2.1: establishing a sensitive field topic recognition model, and performing fine adjustment on the Word2Vec model which is trained on a Word stock of the sensitive field to obtain Word vectors which are more suitable for the field;
step 2.1.1: preparing sensitive domain corpus and public large-scale corpus;
step 2.1.2: inputting the large-scale public corpus into a universal Word2Vec model for training to obtain universal Word vector representation content;
step 2.1.3: constructing a domain word stock based on the technical terms and common words of the acquired sensitive domain;
step 2.1.4: performing fine tuning update on the word vectors related to the field;
step 2.2: inputting the word vector with fine adjustment update into a TextCNN convolutional neural network model for further processing:
inputting word vectors into an input layer of a first layer of a model, converting the word vectors into word embedded vectors by receiving input text sequences, wherein each word corresponds to a vector, forming a matrix by the vectors according to a sequence order, and then inputting the vectors into a second layer;
the input second layer is a convolution layer, the input text matrix is subjected to convolution operation through a plurality of convolution cores with different sizes, so that local characteristics of the text are extracted, and then the third layer is input;
the third layer is a pooling layer used for compressing the dimension of the feature map and extracting important features, and then a fourth layer is input;
the fourth layer of input is a full connection layer, the output of the pooling layer is connected to one or more full connection layers for learning the relation between features and carrying out final classification, and finally the output layer is entered, and the output result is two categories of sensitive field text and non-sensitive field text;
step three: inputting the preprocessed text into an emotion metaphor recognition module in a textCNN-Bert fusion model for processing:
if the emotion metaphor is identified as true, judging that the emotion metaphor is sensitive text information and outputting;
if the emotion metaphor is identified as false, judging that the emotion metaphor is non-sensitive text information and outputting the non-sensitive text information;
step four: and outputting a judging result of the bad information based on the judging made in the second step and the third step.
The specific method for constructing the domain word stock in the step 2.1.3 comprises the following steps:
the method for constructing the sensitive domain word stock based on the TF-POS algorithm is adopted, the domain word is obtained in a mode of counting word frequency and part of speech analysis, the frequency of occurrence of one word in the sensitive domain text is analyzed and calculated, the correlation between the word and sensitive transaction is judged, and the calculation formula of counting word frequency is as follows:
and the expression specifying the part of speech of the field is:
POS={nr,ns,nt,nz,j};
wherein: nr represents a person name, ns represents a place name, nt represents an organization group name, and j represents an abbreviation.
The specific steps of the emotion metaphor recognition module for recognizing the preprocessing text in the step three are as follows:
step 3.1: identifying semantic metaphors in the sensitive text by adopting a Bert pre-training voice model, and specifically inputting a pre-processed text of the content to be identified to the input end of the Bert model, wherein the expression is as follows:
X={x1,x2,x3,…xn}(n≥0);
the identification process of the model is as follows:
step 3.1.1: inputting a sequence word vector of the text;
step 3.1.2: extracting semantic information in a text through a transform coding layer, wherein the layer consists of a plurality of Transformer block blocks, and each block consists of a multi-head self-attention mechanism and a feedforward neural network;
step 3.1.3: extracting deep semantic information through a Bert pre-training task layer;
step 3.1.4: classifying texts through a softMax function, and outputting a judging result as two labels of bad information and general information;
step 3.2: adopting a Bert model fine tuning step, further training the model to adapt to specific tasks on the basis of a pre-training stage, specifically inputting non-sensitive field text, sensitive field general information and sensitive field bad information as a training set and a testing set, training and tuning the model according to a loss function and an evaluation index, specifically adopting a cross entropy loss function when carrying out model fine tuning, and adopting a calculation formula as follows:
wherein: y is i The label representing sample i has a positive class of 1, a negative class of 0, p i Representing the probability that sample i is predicted to be a positive class.
In step 3.1, the specific steps of pre-training the Bert model include an MLM phase and an NSP phase:
in the MLM phase, a piece of text is entered, in particular by the BERT model, and some of the words therein are replaced with [ Mask ] or other random words, the object of the model being to predict these replaced words;
in the NSP phase, two sentences are input specifically by the BERT model, and whether the two sentences are continuous is predicted, so that the model understands the relationship between the two sentences.
Compared with the prior art, the invention has the following beneficial effects: the invention specifically converts the problem of identifying bad information in the sensitive field into a task for identifying a theme in the sensitive field and a task for identifying emotion metaphors, and provides a scheme for identifying bad information in the sensitive field based on a textCNN-Bert fusion model, which not only utilizes the advantage that a textCNN model is more sensitive to keywords and local features to be identified, but also can accurately identify specific languages and terms in the sensitive field, and can also utilize the pretraining capacity and self-attention mechanism of the Bert model to promote the metaphors, metaphors and inheritance in a target sensitive information text to be identified; in addition, the scheme also combines the vocabulary characteristics of the sensitive field, adopts a sensitive field word stock construction algorithm based on a TF-POS algorithm, and recognizes and acquires the professional vocabulary of the sensitive field in a way of counting word frequency and part-of-speech analysis, thereby having obvious advantages in the aspects of accuracy, precision, recall rate and the like compared with the existing algorithm model.
Drawings
The invention is further described below with reference to the accompanying drawings:
FIG. 1 is a flow chart of steps for identifying bad information based on a fusion model algorithm;
FIG. 2 is a flowchart illustrating steps for sensitive domain topic identification in accordance with the present invention;
FIG. 3 is a schematic diagram of a textCNN convolutional neural network model employed in the present invention;
FIG. 4 is a flowchart illustrating steps for emotion metaphor recognition in accordance with the present invention.
Detailed Description
The invention particularly provides a sensitive information identification method based on a textCNN-Bert fusion model algorithm, which adopts a natural language processing technology (Natural Language Processing, NLP) to train a deep learning model, can realize the identification of sensitive text information, and relates to technical means such as corpus construction, word vector representation, language model construction, text classification and the like in the model training process.
The textCNN-Bert fusion model provided by the invention is shown in figure 1, and the model is input as a preprocessed text sequence X= { X 1 ,x 2 ,x 3 ,…x n The preprocessing process comprises word segmentation, part-of-speech tagging and stop word removal, and a judgment result of sensitive text information is output; the recognition model comprises two modules of sensitive field topic recognition and sensitive emotion metaphor recognition:
if the sensitive domain theme is identified as false, judging that the sensitive domain theme is irrelevant to the sensitive domain and is general text information;
if the sensitive field topic is identified as true, judging as an input of emotion metaphor identification;
if the emotion metaphor is recognized as true, judging as bad text information;
if the emotion metaphor is recognized as false, the emotion metaphor is determined as general information of the sensitive area.
Establishing the sensitive field topic identification model comprises the steps of adopting word bank fine tuning and data processing of a neural network:
word2Vec feature domain Word stock fine tuning is adopted, namely a Word2Vec model which is trained is fine tuned on a Word stock in a specific domain to obtain Word vectors which are more suitable for the domain, as shown in fig. 2, the specific process is as follows:
firstly, preparing sensitive domain corpus and public large-scale corpus, secondly, inputting the large-scale public corpus into a universal Word2Vec model for training to obtain universal Word vector representation content, then, constructing a domain lexicon based on the obtained professional terms and common vocabulary of the sensitive domain, and finally, carrying out fine adjustment updating on Word vectors related to the domain.
Furthermore, the invention combines the characteristics of the vocabulary in the sensitive field, adopts a sensitive field word stock construction method based on a TF-POS algorithm, and particularly obtains the vocabulary in the field by counting word frequency and part of speech analysis, and comprises the following steps:
analyzing and calculating the occurrence frequency of a vocabulary in a text in a sensitive field, wherein the occurrence frequency is an important characteristic for judging the correlation between the vocabulary and a sensitive transaction, and the calculation formula for counting the word frequency is as follows:
the expression of the part of speech of the appointed field is POS= { nr, ns, nt, nz, j }, wherein: nr represents a person name, ns represents a place name, nt represents an organization group name, j represents an abbreviation, and the above-mentioned parameter information such as person, organization, event, time, place, etc. has a special meaning in the sensitive field, and therefore, needs to be individually specified.
The adopted convolutional neural network textCNN structure is shown in figure 3, the first layer of the textCNN convolutional neural network model is an input layer and is used for receiving an input text sequence, converting the input text sequence into word embedding vectors, each word corresponds to a vector, and forming a matrix by the vectors according to the sequence order;
the second layer is a convolution layer, and the convolution operation is carried out on the input text matrix through a plurality of convolution cores with different sizes, so that the local characteristics of the text are extracted.
The third layer of input is the pooling layer for compressing the dimensions of the feature map and extracting important features.
The fourth layer of inputs is a fully connected layer, connecting the outputs of the pooling layer to one or more fully connected layers for learning relationships between features and final classification.
And finally, entering an output layer, wherein the output result is two categories of sensitive field text and non-sensitive field text.
When the emotion metaphor recognition model is adopted for analysis processing of sensitive information, because part of bad information expression content is hidden and the package with hidden and confusing properties is arranged outside, the package has extremely strong confusing property with normal text content, and therefore, the key of accurately recognizing the bad information is whether the metaphor of the semantics in the sensitive text can be recognized, and BERT (Bidirectional Encoder Representations from Transformers) adopted by the invention is a pretrained natural language processing model and is applicable to the recognition and understanding of the semantic metaphor.
As shown in fig. 4, a preprocessing text of the content to be identified is specifically input to an input end of the BERT model, where the expression is:
X={x1,x2,x3,…xn}(n≥0);
the output end of the model is the judging result.
The identification process of the model is as follows:
the first step is to input a text sequence word vector.
The second step is to extract semantic information in the text through a transform coding layer consisting of a plurality of Transformer block blocks, each consisting of a multi-headed self-attention mechanism and a feed-forward neural network.
And thirdly, extracting deep semantic information through a Bert pre-training task layer.
And fourthly, classifying texts through a softMax function, and outputting the texts into two labels, namely bad information and general information.
The emotion metaphor recognition model adopted by the invention is obtained through two steps of Bert pre-training and Bert model fine tuning.
The process of pre-training using the BERT model is divided into two phases, masked Language Model (MLM) and Next Sentence Prediction (NSP), respectively:
wherein in the MLM phase, BERT enters a piece of text and replaces some of the words therein with [ Mask ] or other random words, the goal of the model being to predict these replaced words;
wherein in the NSP phase the BERT inputs two sentences and predicts whether the two sentences are consecutive. The purpose of this task is to let the model understand the relationship between two sentences.
The invention adopts an open source pre-training model BERT-base-Chinese issued by a Hugging Face, which is a BERT model trained based on uncased data sets, comprises 12 layers, 768 hidden units and 12 attention heads, and is suitable for tasks such as Chinese text classification and the like.
The invention inputs the news data, the text data in the sensitive field and the bad information in the sensitive field to be used as a training set and a testing set, trains and adjusts the model according to a loss function and an evaluation index, and specifically adopts a cross entropy loss function when the model is adjusted, wherein the expression is as follows:
wherein: y is i The label representing sample i has a positive class of 1, a negative class of 0, p i Representing the probability that sample i is predicted to be a positive class.
According to experimental data of the embodiment, the textCNN-Bert fusion model provided by the invention aiming at sensitive information identification is superior to the currently used model such as TextCNN, LSTM, bert class model in various evaluation indexes, and specific experimental details are as follows:
when experimental data are collected, three partial data sets are mainly collected:
the first part of the experimental dataset collected was: the data are derived from the whole network news data of a dog searching laboratory, and 10 categories of data including automobiles, science, technology, health, sports, real estate, education, travel, culture, IT and fashion are screened, wherein each category comprises about 2000 texts;
the second portion of the experimental dataset collected was: sensitive domain data;
the third portion of the experimental dataset collected was: poor information in sensitive areas.
The statistical data set is divided into the following cases through manual processing and labeling: the method comprises the steps of 82 ten thousand sentences of non-sensitive field data, 73 ten thousand sentences of sensitive field general information data and 78 ten thousand sentences of sensitive field bad information data, and meanwhile, dividing a labeling data set into a training set, a verification set and a test set according to the ratio of 6:2:2.
In order to verify the effectiveness of the text recognition method of the bad information based on the textCNN-Bert fusion model, textCNN, LSTM, BERT is selected as a baseline model, and experimental environments and model parameters are set as shown in tables 1 and 2:
TABLE 1TextCNN model parameters
TABLE 2Bert model parameters
In the index evaluation, the evaluation indexes adopted in the experiment include: accuracy Accuracy, accuracy Precision, recall rate Recall and F1-score values;
the accuracy rate refers to the proportion of all predicted positive classes to the total number, and the calculation expression is as follows:
the recall rate refers to the proportion of all correctly predicted positive classes to all actual positive classes, and the calculation expression is:
the precision refers to the proportion of samples which are predicted to be positive, and the actual positive samples occupy the calculation expression is as follows:
the F1 value integrates the accuracy and the recall, the weights of Pre and Rec are regarded as the same, the weights are based on the harmonic average of the two, the weights are generally used as a comprehensive evaluation index, the higher the F1 value is, the better the performance of the representative model is, and the calculation expression is as follows:
based on the above calculation data, comparison data of the recognition effect of each model is shown in table 3:
table 3 comparison data of model identification effect
As can be seen from the above comparison data, the textCNN-Bert fusion model provided by the invention is superior to the class classification model TextCNN, LSTM, bert in terms of evaluation indexes, and the Precision values of the compared textCNN and LSTM are obviously lower than other indexes, because each independent model cannot understand deep semantics, so that the general information of the sensitive field is judged to be bad information, the Precision value is lower, and the single-use Bert model index is lower than the fusion model used by the invention, because the single-use Bert model index is insensitive to the special vocabulary in the sensitive information field, and cannot make correct identification judgment, so that part of irrelevant contents are also judged to be bad information of the sensitive field.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims (4)

1. A bad information identification method based on a textCNN-Bert fusion model algorithm is characterized by comprising the following steps of: the method comprises the following information identification steps:
step one: preprocessing the text to be recognized by word segmentation, part-of-speech tagging and removal of stop words, and preprocessing the text according to a sequence X= { X 1 ,x 2 ,x 3 ,…x n Inputting a textCNN-Bert fusion model into the model (n is more than 0) for recognition processing;
step two: inputting the preprocessed text into a sensitive field topic identification module in a textCNN-Bert fusion model for processing: if the topic of the sensitive field is identified as false, judging that the topic is irrelevant to the sensitive field, and outputting the topic as non-sensitive text information;
if the sensitive field theme is identified as true, inputting a sensitive emotion metaphor identification module for further judgment;
the specific steps of the sensitive field theme identification module for identifying the preprocessed text are as follows:
step 2.1: establishing a sensitive field topic recognition model, and performing fine adjustment on the Word2Vec model which is trained on a Word stock of the sensitive field to obtain Word vectors which are more suitable for the field;
step 2.1.1: preparing sensitive domain corpus and public large-scale corpus;
step 2.1.2: inputting the large-scale public corpus into a universal Word2Vec model for training to obtain universal Word vector representation content;
step 2.1.3: constructing a domain word stock based on the technical terms and common words of the acquired sensitive domain;
step 2.1.4: performing fine tuning update on the word vectors related to the field;
step 2.2: inputting the word vector with fine adjustment update into a TextCNN convolutional neural network model for further processing:
inputting word vectors into an input layer of a first layer of a model, converting the word vectors into word embedded vectors by receiving input text sequences, wherein each word corresponds to a vector, forming a matrix by the vectors according to a sequence order, and then inputting the vectors into a second layer;
the input second layer is a convolution layer, the input text matrix is subjected to convolution operation through a plurality of convolution cores with different sizes, so that local characteristics of the text are extracted, and then the third layer is input;
the third layer is a pooling layer used for compressing the dimension of the feature map and extracting important features, and then a fourth layer is input;
the fourth layer of input is a full connection layer, the output of the pooling layer is connected to one or more full connection layers for learning the relation between features and carrying out final classification, and finally the output layer is entered, and the output result is two categories of sensitive field text and non-sensitive field text;
step three: inputting the preprocessed text into an emotion metaphor recognition module in a textCNN-Bert fusion model for processing:
if the emotion metaphor is identified as true, judging that the emotion metaphor is sensitive text information and outputting;
if the emotion metaphor is identified as false, judging that the emotion metaphor is non-sensitive text information and outputting the non-sensitive text information;
step four: and outputting a judging result of the bad information based on the judging made in the second step and the third step.
2. The method for identifying the bad information based on the TextCNN-Bert fusion model algorithm according to claim 1, wherein the method comprises the following steps: the specific method for constructing the domain word stock in the step 2.1.3 comprises the following steps:
the method for constructing the sensitive domain word stock based on the TF-POS algorithm is adopted, the domain word is obtained in a mode of counting word frequency and part of speech analysis, the frequency of occurrence of one word in the sensitive domain text is analyzed and calculated, the correlation between the word and sensitive transaction is judged, and the calculation formula of counting word frequency is as follows:
and the expression specifying the part of speech of the field is:
POS={nr,ns,nt,nz,j};
wherein: nr represents a person name, ns represents a place name, nt represents an organization group name, and j represents an abbreviation.
3. The method for identifying the bad information based on the TextCNN-Bert fusion model algorithm according to claim 1, wherein the method comprises the following steps: the specific steps of the emotion metaphor recognition module for recognizing the preprocessing text in the step three are as follows:
step 3.1: identifying semantic metaphors in the sensitive text by adopting a Bert pre-training voice model, and specifically inputting a pre-processed text of the content to be identified to the input end of the Bert model, wherein the expression is as follows:
X={x1,x2,x3,…xn}(n≥0);
the identification process of the model is as follows:
step 3.1.1: inputting a sequence word vector of the text;
step 3.1.2: extracting semantic information in a text through a transform coding layer, wherein the layer consists of a plurality of Transformer block blocks, and each block consists of a multi-head self-attention mechanism and a feedforward neural network;
step 3.1.3: extracting deep semantic information through a Bert pre-training task layer;
step 3.1.4: classifying texts through a softMax function, and outputting a judging result as two labels of bad information and general information;
step 3.2: adopting a Bert model fine tuning step, further training the model to adapt to specific tasks on the basis of a pre-training stage, specifically inputting non-sensitive field text, sensitive field general information and sensitive field bad information as a training set and a testing set, training and tuning the model according to a loss function and an evaluation index, specifically adopting a cross entropy loss function when carrying out model fine tuning, and adopting a calculation formula as follows:
wherein: y is i The label representing sample i has a positive class of 1, a negative class of 0, p i Representing the probability that sample i is predicted to be a positive class.
4. The method for identifying the bad information based on the TextCNN-Bert fusion model algorithm according to claim 3, wherein the method comprises the following steps of: in step 3.1, the specific steps of pre-training the Bert model include an MLM phase and an NSP phase:
in the MLM phase, a piece of text is entered, in particular by the BERT model, and some of the words therein are replaced with [ Mask ] or other random words, the object of the model being to predict these replaced words;
in the NSP phase, two sentences are input specifically by the BERT model, and whether the two sentences are continuous is predicted, so that the model understands the relationship between the two sentences.
CN202310832134.9A 2023-07-07 2023-07-07 Bad information identification method based on textCNN-Bert fusion model algorithm Pending CN116796740A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310832134.9A CN116796740A (en) 2023-07-07 2023-07-07 Bad information identification method based on textCNN-Bert fusion model algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310832134.9A CN116796740A (en) 2023-07-07 2023-07-07 Bad information identification method based on textCNN-Bert fusion model algorithm

Publications (1)

Publication Number Publication Date
CN116796740A true CN116796740A (en) 2023-09-22

Family

ID=88036534

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310832134.9A Pending CN116796740A (en) 2023-07-07 2023-07-07 Bad information identification method based on textCNN-Bert fusion model algorithm

Country Status (1)

Country Link
CN (1) CN116796740A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118094634A (en) * 2024-04-17 2024-05-28 数据空间研究院 Privacy compliance method for unstructured text data
CN118193720A (en) * 2024-05-16 2024-06-14 四川易景智能终端有限公司 Sensitive text filtering method based on end Bian Yun cooperation

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118094634A (en) * 2024-04-17 2024-05-28 数据空间研究院 Privacy compliance method for unstructured text data
CN118193720A (en) * 2024-05-16 2024-06-14 四川易景智能终端有限公司 Sensitive text filtering method based on end Bian Yun cooperation

Similar Documents

Publication Publication Date Title
CN111738003B (en) Named entity recognition model training method, named entity recognition method and medium
CN112732916B (en) BERT-based multi-feature fusion fuzzy text classification system
CN112231472B (en) Judicial public opinion sensitive information identification method integrated with domain term dictionary
CN116796740A (en) Bad information identification method based on textCNN-Bert fusion model algorithm
CN112183094B (en) Chinese grammar debugging method and system based on multiple text features
CN107797987B (en) Bi-LSTM-CNN-based mixed corpus named entity identification method
CN113223509B (en) Fuzzy statement identification method and system applied to multi-person mixed scene
CN112733533A (en) Multi-mode named entity recognition method based on BERT model and text-image relation propagation
CN113742733B (en) Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type
CN109670050A (en) A kind of entity relationship prediction technique and device
CN113191148A (en) Rail transit entity identification method based on semi-supervised learning and clustering
CN113220890A (en) Deep learning method combining news headlines and news long text contents based on pre-training
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN114691864A (en) Text classification model training method and device and text classification method and device
CN112270187A (en) Bert-LSTM-based rumor detection model
CN110008699A (en) A kind of software vulnerability detection method neural network based and device
Suyanto Synonyms-based augmentation to improve fake news detection using bidirectional LSTM
CN112417132A (en) New intention recognition method for screening negative samples by utilizing predicate guest information
CN115238697A (en) Judicial named entity recognition method based on natural language processing
CN113569553A (en) Sentence similarity judgment method based on improved Adaboost algorithm
CN114117041B (en) Attribute-level emotion analysis method based on specific attribute word context modeling
CN117591648A (en) Power grid customer service co-emotion dialogue reply generation method based on emotion fine perception
Wu et al. One improved model of named entity recognition by combining BERT and BiLSTM-CNN for domain of Chinese railway construction
Haque et al. Hadith authenticity prediction using sentiment analysis and machine learning
CN115293133A (en) Vehicle insurance fraud behavior identification method based on extracted text factor enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination