A kind of webpage harmful information recognition methods based on machine learning
Technical field
The present invention relates to web page contents identification technology fields, believe more particularly to a kind of webpage nocuousness based on machine learning
Cease recognition methods.
Background technique
With the continuous development that China's the Internet infrastructure is built, website application service type is increasing, according to statistics,
By the end of the year 2017, China's Websites quantity has reached 526.06 ten thousand, and webpage is even more countless.Webpage and application service at
The important channel of the daily acquisition news of people, information.Due to the space particularity of network, the saved content in website before access not
It is easily known, so being no lack of pornographic, gambling, violence, terror etc. in the webpage of these hundred million meters present on network server
Harmful content, and these harmful content forms, keyword convert often, if harmful content of leaving spreads unchecked, propagates and cause disaster, gesture
It must cause severe social influence.So how effectively to carry out harmfulness examination to web page contents, magnanimity also can satisfy
The problem of performance requirement of data processing is at current urgent need to resolve.
Summary of the invention
It is an object of the invention to overcome the deficiencies of the prior art and provide a kind of webpage harmful information based on machine learning
Recognition methods reaches and obtains corpus, training identification model by crawler, and then differentiates whether web page contents to be screened contain nocuousness
The purpose of content.
The purpose of the present invention is achieved through the following technical solutions: a kind of webpage harmful information based on machine learning
Recognition methods, comprising the following steps:
S1: known classifieds website machine learning training corpus used is crawled using web crawlers;
S2: pre-processing the corpus crawled, generates training set and test set;
S3: model training and the model verifying of machine learning algorithm are carried out;
S4: inputting webpage to be screened, and is classified by model to text, returns to category result and accuracy rate.
The step S2 includes following sub-step:
S201: rejecting html, extracts text information;
S202: preset keyword set rejects low quality text;
S203: training set and test set are generated.
The step S2 further includes following sub-step:
S101: using stop words method is gone, word useless to training and that frequency of occurrence is more is removed;
S102: using preset category keywords mode, filters the weak related corpus in corpus.
Model training and model verifying are carried out using sparse matrix storage language material feature in the step S3.
The beneficial effects of the present invention are:
1) the present invention is based on the algorithms of machine learning provides strong ability support to calculate;By crawler to known harmful letter
Breath crawls, and obtains direct training material, ensure that the authenticity of material, validity;Have by the examination that is trained for model
Evil information provides technological means, by characteristic matching, be found non-training text also can, is calculated using the classification based on machine
Method, to text classification, obtain whether harmful result.
2) the present invention is based on machine learning, training pattern, Text Classifications, carry out Classification and Identification to the webpage of crawl,
According to the generic of webpage recognition result, reaching examination webpage whether there is harmful information, further judges whether website deposits
In the purpose of harmful information.
3) using the algorithm of the text classification based on machine learning, quickly web page contents to be identified can be divided
Class reaches and identifies to harmful content, and has the effect of high-performance and high expansion.
Detailed description of the invention
Fig. 1 is step flow chart of the invention;
Fig. 2 is the flow chart that webpage of the present invention is screened.
Specific embodiment
Below in conjunction with embodiment, technical solution of the present invention is clearly and completely described, it is clear that described
Embodiment is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, this field
Technical staff's every other embodiment obtained under the premise of not making the creative labor belongs to what the present invention protected
Range.
Refering to fig. 1-2, the present invention provides a kind of technical solution: a kind of webpage harmful information identification side based on machine learning
Method, comprising the following steps:
S1: known classifieds website machine learning training corpus used is crawled using web crawlers, i.e., acquisition is known has
The web page contents of evil information, the harmful information include such as pornographic, gambling, violence, terror.Using web crawlers to known nocuousness
Information site carries out content and crawls, can be quickly to webpage to be identified using the algorithm of the text classification based on machine learning
Content is classified, and is reached and is identified to harmful content, and has the effect of high-performance and high expansion.
Using the characteristic of machine learning, calculates web page contents characteristic value to be identified rather than transcription comparison, improve data
Accuracy rate.
S2: pre-processing the corpus crawled, generates training set and test set;
The step S2 includes following sub-step:
S201: rejecting html, extracts text information, simplifies content, saves memory space;
S202: preset keyword set rejects low quality text, keeps data more accurate;
S203: training set and test set are generated.
The step S2 further includes following sub-step:
S101: using stop words method is gone, word useless to training and that frequency of occurrence is more, in information retrieval, place are removed
Certain words or word are fallen in automatic fitration before or after reason natural language data (or text), save memory space, improve and search
Rope efficiency.
S102: using preset category keywords mode, filters the weak related corpus in corpus, improves later period accuracy rate.
S3: model training and the model verifying of machine learning algorithm are carried out, uses training training using machine learning algorithm
Disaggregated model is got, using the accuracy rate of test set verifying model, the model of acceptable accuracy rate is obtained, obtains of all categories
Feature, verifying model accuracy rate only need re -training identification model when increasing new identification classification, convenient and efficient, utilize
Machine learning algorithm can be used recognition result and carry out incremental training to model, improves the subsequent accuracy of identification of model, and can be with
It was found that there is the evil information contents except training material, and then it can also reach the harmful information for crawling wider scope, into one
Step optimization webpage.
Model training and model verifying are carried out using sparse matrix storage language material feature in the step S3, improves and calculates speed
Degree, and save memory space.
S4: inputting webpage to be screened, and is classified by model to text, and returning to category result and accuracy rate, record has
Evil webpage and harmful sites, are screened using to be screened web page contents of the model after training to input, obtain classification results
And accuracy rate saves result and source if meeting preset threshold, judges whether website is harmful net according to preset rules
It stands.
The present invention is based on the algorithms of machine learning to provide strong ability support to calculate;By crawler to known nocuousness
Information crawler obtains direct training material, ensure that the authenticity of material, validity;By being trained for screening to model
Harmful information provides technological means, by characteristic matching, be found non-training text also can, utilizes the classification based on machine
Algorithm, to text classification, obtain whether harmful result.
The above is only a preferred embodiment of the present invention, it should be understood that the present invention is not limited to described herein
Form should not be regarded as an exclusion of other examples, and can be used for other combinations, modifications, and environments, and can be at this
In the text contemplated scope, modifications can be made through the above teachings or related fields of technology or knowledge.And those skilled in the art institute into
Capable modifications and changes do not depart from the spirit and scope of the present invention, then all should be in the protection scope of appended claims of the present invention
It is interior.