CN113065348A - Method for monitoring negative information of internet based on Bert model - Google Patents

Method for monitoring negative information of internet based on Bert model Download PDF

Info

Publication number
CN113065348A
CN113065348A CN202110257490.3A CN202110257490A CN113065348A CN 113065348 A CN113065348 A CN 113065348A CN 202110257490 A CN202110257490 A CN 202110257490A CN 113065348 A CN113065348 A CN 113065348A
Authority
CN
China
Prior art keywords
model
bert
text
lstm
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110257490.3A
Other languages
Chinese (zh)
Other versions
CN113065348B (en
Inventor
张涛
曲昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202110257490.3A priority Critical patent/CN113065348B/en
Publication of CN113065348A publication Critical patent/CN113065348A/en
Application granted granted Critical
Publication of CN113065348B publication Critical patent/CN113065348B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24317Piecewise classification, i.e. whereby each classification requires several discriminant rules
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an internet negative information monitoring method based on a Bert model, which utilizes a crawler technology to acquire data such as a sticker bar, a forum, a microblog and the like to finish data preprocessing. And (5) building a Bert environment and finishing the characteristic extraction of the Bert model. And the preliminary judgment is that the context of the word vector obtained based on the training set is just proper when the word vector is used on the test set because the test set and the training set corpus are relatively close to each other on the field theme. The training set and test set data are taken from the same time frame, and there should be not too many unencoded super new words on the test set. The word2vec + LSTM model also works well. From the view of the model prediction execution efficiency, the Bert model feature extraction method needs to establish a Bertasservice server environment and also needs to obtain the code of the webpage text payload through Webservices call, so that the interaction steps and the complexity are increased, which is a short point of the method.

Description

Method for monitoring negative information of internet based on Bert model
Technical Field
The invention belongs to the technical field of Internet public opinion monitoring, and particularly relates to a BERT natural language processing algorithm, a TF-IDF word bag model and word2vec constructed word vector.
Background
The internet is an important medium for people to obtain information. The internet can exchange information without space limitation, thereby expanding the communication mode of people, widening the visual field of people and enriching the knowledge of people. However, there are some undesirable information contents on the internet, and more generally some yellow gambling poison contents. On one hand, the bad information is like a psychotropic opium, can poison and erode the growth of teenagers, and can also make many ordinary people drown at a low level; on the other hand, websites with such contents are often installed on some foreign cloud hosts or servers. When users in China visit, a large amount of cross-border traffic of gateway bureaus is generated, thereby occupying outlet bandwidth resources and causing a large amount of settlement expense for operators.
In the conventional mode, a so-called black list library can be maintained through a user crowdsourcing mode, such as reward reporting, and a large amount of manual audits, such as yellow teachers of pornographic websites. Then the operator intercepts the URL in the blacklist library to achieve the purpose of blocking the bad information content. However, such websites are often updated by changing domain names, updating background maps of web pages, or updating partial text contents, so as to avoid the examination and filtering of the blacklist library, and the implementation of the behavior is fast and cheap.
Therefore, operators want to be able to automatically identify those new URLs containing bad information content in network traffic by means of intelligent algorithms. Generally, the content contained in the website has types of text, picture, video/audio, and the like, different intelligent algorithms can be identified based on these 3 types of information sources, and an intelligent detection method based on the text content of the website is mainly discussed herein.
Disclosure of Invention
The method is mainly used for detecting the bad information of the website by depending on the key technical characteristics of the Bert model and two different application methods thereof and utilizing the characteristic extraction method.
The technical scheme provided by the invention is as follows: the internet negative information monitoring method based on the Bert model comprises the steps of firstly, obtaining data such as a sticker bar, a forum and a microblog by using a crawler technology, and finishing data preprocessing. And then building a Bert environment to finish the characteristic extraction of the Bert model. The method specifically comprises the following steps:
step a: a given web site URL is detected to determine whether it belongs to a web page that relates to gambling material. The problem is abstracted into a two-classification problem according to general knowledge of data mining, namely, the web pages are classified into two types of yes or no gambling by using a model algorithm according to input features.
Step b: two more conventional text classification methods are selected as a comparison group, which are respectively as follows: TF-IDF model, word2vec + LSTM model.
Step c: on the compared indices, 3 evaluation indices commonly used in the 2-class problem were selected.
Precision (precision): p ═ TP/(TP + FP).
Recall (recall): and R is TP/(TP + FN).
F1 value: f1 ═ 2PR/(P + R).
The three indexes are respectively expressed as: how many of the samples predicted to be positive are true positive samples, with respect to the prediction result; how many positive examples in the sample are predicted correctly is for the original sample; and the harmonic mean value of the accurate value and the recall rate is the comprehensive evaluation of the two indexes.
Compared with a TF-IDF model method and a word2vec + LSTM model method. The Bert model has obvious advantages:
(1) compared with the simple word vector word2vec + LSTM model which also keeps the context semantics, the Bert model feature extract method has slightly better but not obvious effect evaluation index. The initial judgment is mainly because the test set and the training set corpus are relatively close on the domain theme, and the context of the word vector obtained based on the training set is just suitable when the word vector is used on the test set. The training set and test set data are taken from the same time frame, and there should be not too many unencoded super new words on the test set. Therefore, the word2vec + LSTM model also works well.
(2) The Bert model feature extraction method does not need to execute embedding, so the training efficiency is high. From the view of the model prediction execution efficiency, the method for extracting the characteristics of the Bert model needs to build a berts as service server environment and also needs to obtain the code of the text payload of the webpage through the calling of Web services, so that the interaction steps and the complexity are increased, and the method is a short place.
Drawings
FIG. 1 is a diagram of a Bert model network architecture;
FIG. 2 is an input representation of the Bert model.
Detailed Description
The invention is further described with reference to the following figures and detailed description.
Step 1: according to the provided gambling website list, a blacklist website sample with nearly thousand Chinese content pages as main parts is obtained. Selecting Chinese page websites from the websites, and sampling to obtain white list website samples.
The crawler tool is utilized to crawl off the home page HTML content of the black and white sample website. And (3) using a beautiful soup webpage analysis toolkit in a Python environment, filtering information such as HTML labels and JavaScript scripts which are irrelevant to the actual webpage subject content, and only reserving Chinese characters and punctuations as text payloads.
Step 2: building a Bert environment, wherein the requirement of the Bert as service operation environment is as follows: python > -3.5 and Tensorflow > -1.10.
And deploying server-side and client-side tools of the Berts service. And after the installation is finished, starting the Bert as service.
And step 3: the client calls the berts as service method to take a webpage text payload as a sentence unit. And after receiving the text sentence, the server performs fixed-length coding on the sentence and returns the text sentence to the client, so that the extraction of the text sequence characteristics of the black and white sample is realized.
The berts as service sentence encodes a fixed-length default 768 dimensions.
And 4, step 4: a certain proportion (herein set proportion of 80%) of the samples was randomly selected as a model training set, and the rest was used as an independent test set.
And 5: in a Python environment, an XGboost integrated classification algorithm is selected, 768-dimensional sentence codes are used as input features, and classification model training is carried out on training set data.
Step 6: and designing a control group experiment, wherein the control group is a TF-IDF model method and a word2vec + LSTM model method respectively.
And 7: and performing word segmentation on the webpage text payload by adopting a jieba word segmentation tool in a Python environment.
And extracting the statistical characteristics of the TF-IDF of the webpage text payload by using a TF-IDF algorithm packaged in a gensim toolkit, and properly truncating the dictionary to avoid the high-dimensional problem of the characteristics. And inputting the statistical characteristics as a classifier, and constructing a classification model by using an xgboost integrated classification algorithm.
And 8: also, to first segment the web page text payload, the same tools and methods described above can be used.
The following is word2vec word vector embedding learning and LSTM neural network building. To simplify the implementation, a modular neural network library Keras framework was used. Keras makes full use of Tensorflow general computing power, and well encapsulates word vectors embedding and various neural network units including LSTM, thereby reducing programming overhead and focusing more on the deep learning model.
The webpage text payload which completes word segmentation enters an LSTM layer for context learning through embedding vectorization, then the dimension of an output result of the LSTM is reduced to the number of categories (2) of a target variable through a fully-connected Dense layer, probability distribution of the input text payload in the two categories can be obtained by using sigmoid as an activation function, and therefore the category model construction is completed.
In order to verify the effectiveness and feasibility of the invention, a simulation experiment was performed using approximately 2000 sets of crawled data, and the experimental results are as follows. FIG. 1 is classification model input data.
FIG. 1 Classification model input data
Figure BDA0002968162230000051
After the model is trained, aiming at the Bert model feature extract method and two text classification methods in a contrast group,
predictions and evaluations were made based on the same independent test set, respectively.
The model effect evaluation index data based on the independent test set is shown in fig. 2.
FIG. 2 test model Effect evaluation control
Figure BDA0002968162230000052
From the result of the comparison experiment, by adopting the characteristic extraction method of the Bert model, compared with the traditional TF-IDF model, various indexes of the model effect are greatly improved, and the method possibly has close relation with the Bert model and well keeps the context information of the text. Meanwhile, the feature extraction method is effective for extracting feature information of webpage text content.

Claims (2)

1. The internet negative information monitoring method based on the Bert model is characterized by comprising the following steps: the method comprises the following steps of,
step 1: obtaining blacklist website samples mainly comprising thousands of Chinese content pages according to the provided gambling website list; selecting Chinese page websites from the websites, and sampling to obtain white list website samples;
crawling off the HTML content of the home page of the black and white sample website by using a crawler tool; using a beautifull webpage analysis toolkit in a Python environment, filtering information of an HTML label and a JavaScript script which are irrelevant to the actual webpage subject content, and only reserving Chinese characters and punctuations as text payloads;
step 2: building a Bert environment, wherein the requirement of the Bertastervice operation environment is as follows: python > -3.5, Tensorflow > -1.10;
deploying server-side and client-side tools of the Bertastervice; after the installation is finished, starting a Bertastervice service;
and step 3: the client calls a Bertastervice method, and takes a webpage text payload as a sentence unit; after receiving the text sentence, the server side performs fixed length coding on the sentence and returns the text sentence to the client side, and the text sequence feature extraction of the black and white sample is realized through the way;
encoding a Berasservice sentence with a fixed length and default 768 dimensions;
and 4, step 4: randomly selecting 80% of samples as a model training set, and using the rest parts as independent test sets;
and 5: in a Python environment, selecting an XGboost integrated classification algorithm, taking 768-dimensional sentence codes as input features, and carrying out classification model training on training set data;
step 6: designing a control group experiment, wherein the control group is a TF-IDF model method and a word2vec + LSTM model method respectively;
and 7: adopting a jieba word segmentation tool in a Python environment to segment the word of the webpage text payload;
extracting the statistical characteristics of the TF-IDF of the webpage text payload by using a TF-IDF algorithm packaged in a gensim toolkit, and properly truncating a dictionary to avoid the high-dimensional problem of the characteristics; inputting the statistical characteristics as a classifier, and constructing a classification model by using an xgboost integration classification algorithm;
and 8: segmenting the webpage text payload; word2vec word vector embedding learning and LSTM neural network building.
2. The internet negative information monitoring method based on the Bert model as claimed in claim 1, wherein: in step 8, a Keras framework of a modular neural network library is used; keras fully utilizes the Tensorflow general computing capability, encapsulates word vectors embedding and various neural network units including LSTM, and reduces programming overhead;
the webpage text payload which completes word segmentation enters an LSTM layer for context learning through embedding vectorization, then the dimension of an output result of the LSTM is reduced to the number of categories of target variables through a fully-connected Dense layer, and probability distribution of the input text payload in the two categories is obtained by using sigmoid as an activation function, so that the construction of a category model is completed.
CN202110257490.3A 2021-03-09 2021-03-09 Internet negative information monitoring method based on Bert model Active CN113065348B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110257490.3A CN113065348B (en) 2021-03-09 2021-03-09 Internet negative information monitoring method based on Bert model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110257490.3A CN113065348B (en) 2021-03-09 2021-03-09 Internet negative information monitoring method based on Bert model

Publications (2)

Publication Number Publication Date
CN113065348A true CN113065348A (en) 2021-07-02
CN113065348B CN113065348B (en) 2024-04-16

Family

ID=76560023

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110257490.3A Active CN113065348B (en) 2021-03-09 2021-03-09 Internet negative information monitoring method based on Bert model

Country Status (1)

Country Link
CN (1) CN113065348B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110728153A (en) * 2019-10-15 2020-01-24 天津理工大学 Multi-category emotion classification method based on model fusion
CN111046941A (en) * 2019-12-09 2020-04-21 腾讯科技(深圳)有限公司 Target comment detection method and device, electronic equipment and storage medium
CN111209401A (en) * 2020-01-03 2020-05-29 西安电子科技大学 System and method for classifying and processing sentiment polarity of online public opinion text information
US20210034812A1 (en) * 2019-07-30 2021-02-04 Imrsv Data Labs Inc. Methods and systems for multi-label classification of text data
CN112445913A (en) * 2020-11-25 2021-03-05 重庆邮电大学 Financial information negative main body judgment and classification method based on big data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210034812A1 (en) * 2019-07-30 2021-02-04 Imrsv Data Labs Inc. Methods and systems for multi-label classification of text data
CN110728153A (en) * 2019-10-15 2020-01-24 天津理工大学 Multi-category emotion classification method based on model fusion
CN111046941A (en) * 2019-12-09 2020-04-21 腾讯科技(深圳)有限公司 Target comment detection method and device, electronic equipment and storage medium
CN111209401A (en) * 2020-01-03 2020-05-29 西安电子科技大学 System and method for classifying and processing sentiment polarity of online public opinion text information
CN112445913A (en) * 2020-11-25 2021-03-05 重庆邮电大学 Financial information negative main body judgment and classification method based on big data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
史振杰;董兆伟;庞超逸;张百灵;孙立辉;: "基于BERT-CNN的电商评论情感分析", 智能计算机与应用, no. 02, 1 February 2020 (2020-02-01) *
吴俊;程垚;郝瀚;艾力亚尔・艾则孜;刘菲雪;苏亦坡;: "基于BERT嵌入BiLSTM-CRF模型的中文专业术语抽取研究", 情报学报, no. 04, 24 April 2020 (2020-04-24) *

Also Published As

Publication number Publication date
CN113065348B (en) 2024-04-16

Similar Documents

Publication Publication Date Title
CN107516041B (en) WebShell detection method and system based on deep neural network
CN110012122B (en) Domain name similarity analysis method based on word embedding technology
CN107341399A (en) Assess the method and device of code file security
CN114422211B (en) HTTP malicious traffic detection method and device based on graph attention network
CN111401063B (en) Text processing method and device based on multi-pool network and related equipment
CN101470752A (en) Search engine method based on keyword resolution scheduling
CN114422271B (en) Data processing method, device, equipment and readable storage medium
Zhu et al. CCBLA: a lightweight phishing detection model based on CNN, BiLSTM, and attention mechanism
Pham et al. Exploring efficiency of GAN-based generated URLs for phishing URL detection
CN114154043A (en) Website fingerprint calculation method, system, storage medium and terminal
CN117112814A (en) False media content mining and identification system and identification method thereof
KR102483004B1 (en) Method for detecting harmful url
CN113065348B (en) Internet negative information monitoring method based on Bert model
CN108897739A (en) A kind of intelligentized application traffic identification feature automatic mining method and system
CN110149810B (en) Transmission system and method for limiting manipulation of content in a network environment and digital assistant device
CN114328818A (en) Text corpus processing method and device, storage medium and electronic equipment
CN113946823A (en) SQL injection detection method and device based on URL baseline deviation analysis
CN113688346A (en) Illegal website identification method, device, equipment and storage medium
Wan et al. Generation of malicious webpage samples based on GAN
CN113343142B (en) News click rate prediction method based on user behavior sequence filling and screening
Li et al. Research on Integrated Detection of SQL Injection Behavior Based on Text Features and Traffic Features
CN104484414A (en) Processing method and device of favourite information
CN108197142B (en) Method, device, storage medium and equipment for determining relevance of network transaction
CN117118760B (en) Threat perception method, device and storage medium for traffic forwarding based on pseudo network
CN117714130A (en) Network message detection method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant