CN113065348A

CN113065348A - Method for monitoring negative information of internet based on Bert model

Info

Publication number: CN113065348A
Application number: CN202110257490.3A
Authority: CN
Inventors: 张涛; 曲昊
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-03-09
Filing date: 2021-03-09
Publication date: 2021-07-02
Anticipated expiration: 2041-03-09
Also published as: CN113065348B

Abstract

The invention discloses an internet negative information monitoring method based on a Bert model, which utilizes a crawler technology to acquire data such as a sticker bar, a forum, a microblog and the like to finish data preprocessing. And (5) building a Bert environment and finishing the characteristic extraction of the Bert model. And the preliminary judgment is that the context of the word vector obtained based on the training set is just proper when the word vector is used on the test set because the test set and the training set corpus are relatively close to each other on the field theme. The training set and test set data are taken from the same time frame, and there should be not too many unencoded super new words on the test set. The word2vec + LSTM model also works well. From the view of the model prediction execution efficiency, the Bert model feature extraction method needs to establish a Bertasservice server environment and also needs to obtain the code of the webpage text payload through Webservices call, so that the interaction steps and the complexity are increased, which is a short point of the method.

Description

Method for monitoring negative information of internet based on Bert model

Technical Field

The invention belongs to the technical field of Internet public opinion monitoring, and particularly relates to a BERT natural language processing algorithm, a TF-IDF word bag model and word2vec constructed word vector.

Background

The internet is an important medium for people to obtain information. The internet can exchange information without space limitation, thereby expanding the communication mode of people, widening the visual field of people and enriching the knowledge of people. However, there are some undesirable information contents on the internet, and more generally some yellow gambling poison contents. On one hand, the bad information is like a psychotropic opium, can poison and erode the growth of teenagers, and can also make many ordinary people drown at a low level; on the other hand, websites with such contents are often installed on some foreign cloud hosts or servers. When users in China visit, a large amount of cross-border traffic of gateway bureaus is generated, thereby occupying outlet bandwidth resources and causing a large amount of settlement expense for operators.

In the conventional mode, a so-called black list library can be maintained through a user crowdsourcing mode, such as reward reporting, and a large amount of manual audits, such as yellow teachers of pornographic websites. Then the operator intercepts the URL in the blacklist library to achieve the purpose of blocking the bad information content. However, such websites are often updated by changing domain names, updating background maps of web pages, or updating partial text contents, so as to avoid the examination and filtering of the blacklist library, and the implementation of the behavior is fast and cheap.

Therefore, operators want to be able to automatically identify those new URLs containing bad information content in network traffic by means of intelligent algorithms. Generally, the content contained in the website has types of text, picture, video/audio, and the like, different intelligent algorithms can be identified based on these 3 types of information sources, and an intelligent detection method based on the text content of the website is mainly discussed herein.

Disclosure of Invention

The method is mainly used for detecting the bad information of the website by depending on the key technical characteristics of the Bert model and two different application methods thereof and utilizing the characteristic extraction method.

The technical scheme provided by the invention is as follows: the internet negative information monitoring method based on the Bert model comprises the steps of firstly, obtaining data such as a sticker bar, a forum and a microblog by using a crawler technology, and finishing data preprocessing. And then building a Bert environment to finish the characteristic extraction of the Bert model. The method specifically comprises the following steps:

step a: a given web site URL is detected to determine whether it belongs to a web page that relates to gambling material. The problem is abstracted into a two-classification problem according to general knowledge of data mining, namely, the web pages are classified into two types of yes or no gambling by using a model algorithm according to input features.

Step b: two more conventional text classification methods are selected as a comparison group, which are respectively as follows: TF-IDF model, word2vec + LSTM model.

Step c: on the compared indices, 3 evaluation indices commonly used in the 2-class problem were selected.

Precision (precision): p ═ TP/(TP + FP).

Recall (recall): and R is TP/(TP + FN).

F1 value: f1 ═ 2PR/(P + R).

The three indexes are respectively expressed as: how many of the samples predicted to be positive are true positive samples, with respect to the prediction result; how many positive examples in the sample are predicted correctly is for the original sample; and the harmonic mean value of the accurate value and the recall rate is the comprehensive evaluation of the two indexes.

Compared with a TF-IDF model method and a word2vec + LSTM model method. The Bert model has obvious advantages:

(1) compared with the simple word vector word2vec + LSTM model which also keeps the context semantics, the Bert model feature extract method has slightly better but not obvious effect evaluation index. The initial judgment is mainly because the test set and the training set corpus are relatively close on the domain theme, and the context of the word vector obtained based on the training set is just suitable when the word vector is used on the test set. The training set and test set data are taken from the same time frame, and there should be not too many unencoded super new words on the test set. Therefore, the word2vec + LSTM model also works well.

(2) The Bert model feature extraction method does not need to execute embedding, so the training efficiency is high. From the view of the model prediction execution efficiency, the method for extracting the characteristics of the Bert model needs to build a berts as service server environment and also needs to obtain the code of the text payload of the webpage through the calling of Web services, so that the interaction steps and the complexity are increased, and the method is a short place.

Drawings

FIG. 1 is a diagram of a Bert model network architecture;

FIG. 2 is an input representation of the Bert model.

Detailed Description

The invention is further described with reference to the following figures and detailed description.

Step 1: according to the provided gambling website list, a blacklist website sample with nearly thousand Chinese content pages as main parts is obtained. Selecting Chinese page websites from the websites, and sampling to obtain white list website samples.

The crawler tool is utilized to crawl off the home page HTML content of the black and white sample website. And (3) using a beautiful soup webpage analysis toolkit in a Python environment, filtering information such as HTML labels and JavaScript scripts which are irrelevant to the actual webpage subject content, and only reserving Chinese characters and punctuations as text payloads.

Step 2: building a Bert environment, wherein the requirement of the Bert as service operation environment is as follows: python > -3.5 and Tensorflow > -1.10.

And deploying server-side and client-side tools of the Berts service. And after the installation is finished, starting the Bert as service.

And step 3: the client calls the berts as service method to take a webpage text payload as a sentence unit. And after receiving the text sentence, the server performs fixed-length coding on the sentence and returns the text sentence to the client, so that the extraction of the text sequence characteristics of the black and white sample is realized.

The berts as service sentence encodes a fixed-length default 768 dimensions.

And 4, step 4: a certain proportion (herein set proportion of 80%) of the samples was randomly selected as a model training set, and the rest was used as an independent test set.

And 5: in a Python environment, an XGboost integrated classification algorithm is selected, 768-dimensional sentence codes are used as input features, and classification model training is carried out on training set data.

Step 6: and designing a control group experiment, wherein the control group is a TF-IDF model method and a word2vec + LSTM model method respectively.

And 7: and performing word segmentation on the webpage text payload by adopting a jieba word segmentation tool in a Python environment.

And extracting the statistical characteristics of the TF-IDF of the webpage text payload by using a TF-IDF algorithm packaged in a gensim toolkit, and properly truncating the dictionary to avoid the high-dimensional problem of the characteristics. And inputting the statistical characteristics as a classifier, and constructing a classification model by using an xgboost integrated classification algorithm.

And 8: also, to first segment the web page text payload, the same tools and methods described above can be used.

The following is word2vec word vector embedding learning and LSTM neural network building. To simplify the implementation, a modular neural network library Keras framework was used. Keras makes full use of Tensorflow general computing power, and well encapsulates word vectors embedding and various neural network units including LSTM, thereby reducing programming overhead and focusing more on the deep learning model.

The webpage text payload which completes word segmentation enters an LSTM layer for context learning through embedding vectorization, then the dimension of an output result of the LSTM is reduced to the number of categories (2) of a target variable through a fully-connected Dense layer, probability distribution of the input text payload in the two categories can be obtained by using sigmoid as an activation function, and therefore the category model construction is completed.

In order to verify the effectiveness and feasibility of the invention, a simulation experiment was performed using approximately 2000 sets of crawled data, and the experimental results are as follows. FIG. 1 is classification model input data.

FIG. 1 Classification model input data

After the model is trained, aiming at the Bert model feature extract method and two text classification methods in a contrast group,

predictions and evaluations were made based on the same independent test set, respectively.

The model effect evaluation index data based on the independent test set is shown in fig. 2.

FIG. 2 test model Effect evaluation control

From the result of the comparison experiment, by adopting the characteristic extraction method of the Bert model, compared with the traditional TF-IDF model, various indexes of the model effect are greatly improved, and the method possibly has close relation with the Bert model and well keeps the context information of the text. Meanwhile, the feature extraction method is effective for extracting feature information of webpage text content.

Claims

1. The internet negative information monitoring method based on the Bert model is characterized by comprising the following steps: the method comprises the following steps of,

step 1: obtaining blacklist website samples mainly comprising thousands of Chinese content pages according to the provided gambling website list; selecting Chinese page websites from the websites, and sampling to obtain white list website samples;

crawling off the HTML content of the home page of the black and white sample website by using a crawler tool; using a beautifull webpage analysis toolkit in a Python environment, filtering information of an HTML label and a JavaScript script which are irrelevant to the actual webpage subject content, and only reserving Chinese characters and punctuations as text payloads;

step 2: building a Bert environment, wherein the requirement of the Bertastervice operation environment is as follows: python > -3.5, Tensorflow > -1.10;

deploying server-side and client-side tools of the Bertastervice; after the installation is finished, starting a Bertastervice service;

and step 3: the client calls a Bertastervice method, and takes a webpage text payload as a sentence unit; after receiving the text sentence, the server side performs fixed length coding on the sentence and returns the text sentence to the client side, and the text sequence feature extraction of the black and white sample is realized through the way;

encoding a Berasservice sentence with a fixed length and default 768 dimensions;

and 4, step 4: randomly selecting 80% of samples as a model training set, and using the rest parts as independent test sets;

and 5: in a Python environment, selecting an XGboost integrated classification algorithm, taking 768-dimensional sentence codes as input features, and carrying out classification model training on training set data;

step 6: designing a control group experiment, wherein the control group is a TF-IDF model method and a word2vec + LSTM model method respectively;

and 7: adopting a jieba word segmentation tool in a Python environment to segment the word of the webpage text payload;

extracting the statistical characteristics of the TF-IDF of the webpage text payload by using a TF-IDF algorithm packaged in a gensim toolkit, and properly truncating a dictionary to avoid the high-dimensional problem of the characteristics; inputting the statistical characteristics as a classifier, and constructing a classification model by using an xgboost integration classification algorithm;

and 8: segmenting the webpage text payload; word2vec word vector embedding learning and LSTM neural network building.

2. The internet negative information monitoring method based on the Bert model as claimed in claim 1, wherein: in step 8, a Keras framework of a modular neural network library is used; keras fully utilizes the Tensorflow general computing capability, encapsulates word vectors embedding and various neural network units including LSTM, and reduces programming overhead;

the webpage text payload which completes word segmentation enters an LSTM layer for context learning through embedding vectorization, then the dimension of an output result of the LSTM is reduced to the number of categories of target variables through a fully-connected Dense layer, and probability distribution of the input text payload in the two categories is obtained by using sigmoid as an activation function, so that the construction of a category model is completed.