CN109710825A

CN109710825A - Webpage harmful information identification method based on machine learning

Info

Publication number: CN109710825A
Application number: CN201811302974.XA
Authority: CN
Inventors: 张家亮; 卢江波; 张明亮; 贾宇
Original assignee: Chengdu 30kaitian Communication Industry Co ltd
Current assignee: Shenzhen Wanglian Anrui Network Technology Co ltd
Priority date: 2018-11-02
Filing date: 2018-11-02
Publication date: 2019-05-03

Abstract

The invention discloses a method for identifying harmful information of a webpage based on machine learning, which comprises the following steps: s1: crawling a corpus used for machine learning training of a known classification website by using a web crawler; s2: preprocessing the crawled corpus to generate a training set and a test set; s3: performing model training and model verification of a machine learning algorithm; s4: and inputting a webpage to be screened, classifying the text through the model, and returning a classification result and accuracy. The method is based on machine learning, training models and text classification technology, carries out classification and identification on the captured webpages, and achieves the purposes of judging whether harmful information exists in the webpages and further judging whether harmful information exists in websites according to the categories of webpage identification results.

Description

A kind of webpage harmful information recognition methods based on machine learning

Technical field

The present invention relates to web page contents identification technology fields, believe more particularly to a kind of webpage nocuousness based on machine learning Cease recognition methods.

Background technique

With the continuous development that China's the Internet infrastructure is built, website application service type is increasing, according to statistics, By the end of the year 2017, China's Websites quantity has reached 526.06 ten thousand, and webpage is even more countless.Webpage and application service at The important channel of the daily acquisition news of people, information.Due to the space particularity of network, the saved content in website before access not It is easily known, so being no lack of pornographic, gambling, violence, terror etc. in the webpage of these hundred million meters present on network server Harmful content, and these harmful content forms, keyword convert often, if harmful content of leaving spreads unchecked, propagates and cause disaster, gesture It must cause severe social influence.So how effectively to carry out harmfulness examination to web page contents, magnanimity also can satisfy The problem of performance requirement of data processing is at current urgent need to resolve.

Summary of the invention

It is an object of the invention to overcome the deficiencies of the prior art and provide a kind of webpage harmful information based on machine learning Recognition methods reaches and obtains corpus, training identification model by crawler, and then differentiates whether web page contents to be screened contain nocuousness The purpose of content.

The purpose of the present invention is achieved through the following technical solutions: a kind of webpage harmful information based on machine learning Recognition methods, comprising the following steps:

S1: known classifieds website machine learning training corpus used is crawled using web crawlers；

S2: pre-processing the corpus crawled, generates training set and test set；

S3: model training and the model verifying of machine learning algorithm are carried out；

S4: inputting webpage to be screened, and is classified by model to text, returns to category result and accuracy rate.

The step S2 includes following sub-step:

S201: rejecting html, extracts text information；

S202: preset keyword set rejects low quality text；

S203: training set and test set are generated.

The step S2 further includes following sub-step:

S101: using stop words method is gone, word useless to training and that frequency of occurrence is more is removed；

S102: using preset category keywords mode, filters the weak related corpus in corpus.

Model training and model verifying are carried out using sparse matrix storage language material feature in the step S3.

The beneficial effects of the present invention are:

1) the present invention is based on the algorithms of machine learning provides strong ability support to calculate；By crawler to known harmful letter Breath crawls, and obtains direct training material, ensure that the authenticity of material, validity；Have by the examination that is trained for model Evil information provides technological means, by characteristic matching, be found non-training text also can, is calculated using the classification based on machine Method, to text classification, obtain whether harmful result.

2) the present invention is based on machine learning, training pattern, Text Classifications, carry out Classification and Identification to the webpage of crawl, According to the generic of webpage recognition result, reaching examination webpage whether there is harmful information, further judges whether website deposits In the purpose of harmful information.

3) using the algorithm of the text classification based on machine learning, quickly web page contents to be identified can be divided Class reaches and identifies to harmful content, and has the effect of high-performance and high expansion.

Detailed description of the invention

Fig. 1 is step flow chart of the invention；

Fig. 2 is the flow chart that webpage of the present invention is screened.

Specific embodiment

Below in conjunction with embodiment, technical solution of the present invention is clearly and completely described, it is clear that described Embodiment is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, this field Technical staff's every other embodiment obtained under the premise of not making the creative labor belongs to what the present invention protected Range.

Refering to fig. 1-2, the present invention provides a kind of technical solution: a kind of webpage harmful information identification side based on machine learning Method, comprising the following steps:

S1: known classifieds website machine learning training corpus used is crawled using web crawlers, i.e., acquisition is known has The web page contents of evil information, the harmful information include such as pornographic, gambling, violence, terror.Using web crawlers to known nocuousness Information site carries out content and crawls, can be quickly to webpage to be identified using the algorithm of the text classification based on machine learning Content is classified, and is reached and is identified to harmful content, and has the effect of high-performance and high expansion.

Using the characteristic of machine learning, calculates web page contents characteristic value to be identified rather than transcription comparison, improve data Accuracy rate.

S2: pre-processing the corpus crawled, generates training set and test set；

The step S2 includes following sub-step:

S201: rejecting html, extracts text information, simplifies content, saves memory space；

S202: preset keyword set rejects low quality text, keeps data more accurate；

S203: training set and test set are generated.

The step S2 further includes following sub-step:

S101: using stop words method is gone, word useless to training and that frequency of occurrence is more, in information retrieval, place are removed Certain words or word are fallen in automatic fitration before or after reason natural language data (or text), save memory space, improve and search Rope efficiency.

S102: using preset category keywords mode, filters the weak related corpus in corpus, improves later period accuracy rate.

S3: model training and the model verifying of machine learning algorithm are carried out, uses training training using machine learning algorithm Disaggregated model is got, using the accuracy rate of test set verifying model, the model of acceptable accuracy rate is obtained, obtains of all categories Feature, verifying model accuracy rate only need re -training identification model when increasing new identification classification, convenient and efficient, utilize Machine learning algorithm can be used recognition result and carry out incremental training to model, improves the subsequent accuracy of identification of model, and can be with It was found that there is the evil information contents except training material, and then it can also reach the harmful information for crawling wider scope, into one Step optimization webpage.

Model training and model verifying are carried out using sparse matrix storage language material feature in the step S3, improves and calculates speed Degree, and save memory space.

S4: inputting webpage to be screened, and is classified by model to text, and returning to category result and accuracy rate, record has Evil webpage and harmful sites, are screened using to be screened web page contents of the model after training to input, obtain classification results And accuracy rate saves result and source if meeting preset threshold, judges whether website is harmful net according to preset rules It stands.

The present invention is based on the algorithms of machine learning to provide strong ability support to calculate；By crawler to known nocuousness Information crawler obtains direct training material, ensure that the authenticity of material, validity；By being trained for screening to model Harmful information provides technological means, by characteristic matching, be found non-training text also can, utilizes the classification based on machine Algorithm, to text classification, obtain whether harmful result.

The above is only a preferred embodiment of the present invention, it should be understood that the present invention is not limited to described herein Form should not be regarded as an exclusion of other examples, and can be used for other combinations, modifications, and environments, and can be at this In the text contemplated scope, modifications can be made through the above teachings or related fields of technology or knowledge.And those skilled in the art institute into Capable modifications and changes do not depart from the spirit and scope of the present invention, then all should be in the protection scope of appended claims of the present invention It is interior.

Claims

1. a kind of webpage harmful information recognition methods based on machine learning, it is characterised in that: the following steps are included:

S2: pre-processing the corpus crawled, generates training set and test set；

2. a kind of webpage harmful information recognition methods based on machine learning according to claim 1, it is characterised in that: institute Stating step S2 includes following sub-step:

S201: rejecting html, extracts text information；

S202: preset keyword set rejects low quality text；

S203: training set and test set are generated.

3. a kind of webpage harmful information recognition methods based on machine learning according to claim 1 or 2, feature exist In: the step S2 further includes following sub-step:

4. a kind of webpage harmful information recognition methods based on machine learning according to claim 1, it is characterised in that: institute It states in step S3 and model training and model verifying is carried out using sparse matrix storage language material feature.