CN109710825A - Webpage harmful information identification method based on machine learning - Google Patents

Webpage harmful information identification method based on machine learning Download PDF

Info

Publication number
CN109710825A
CN109710825A CN201811302974.XA CN201811302974A CN109710825A CN 109710825 A CN109710825 A CN 109710825A CN 201811302974 A CN201811302974 A CN 201811302974A CN 109710825 A CN109710825 A CN 109710825A
Authority
CN
China
Prior art keywords
machine learning
webpage
training
model
harmful information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811302974.XA
Other languages
Chinese (zh)
Inventor
张家亮
卢江波
张明亮
贾宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Wanglian Anrui Network Technology Co ltd
Original Assignee
Chengdu 30kaitian Communication Industry Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu 30kaitian Communication Industry Co ltd filed Critical Chengdu 30kaitian Communication Industry Co ltd
Priority to CN201811302974.XA priority Critical patent/CN109710825A/en
Publication of CN109710825A publication Critical patent/CN109710825A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for identifying harmful information of a webpage based on machine learning, which comprises the following steps: s1: crawling a corpus used for machine learning training of a known classification website by using a web crawler; s2: preprocessing the crawled corpus to generate a training set and a test set; s3: performing model training and model verification of a machine learning algorithm; s4: and inputting a webpage to be screened, classifying the text through the model, and returning a classification result and accuracy. The method is based on machine learning, training models and text classification technology, carries out classification and identification on the captured webpages, and achieves the purposes of judging whether harmful information exists in the webpages and further judging whether harmful information exists in websites according to the categories of webpage identification results.

Description

A kind of webpage harmful information recognition methods based on machine learning
Technical field
The present invention relates to web page contents identification technology fields, believe more particularly to a kind of webpage nocuousness based on machine learning Cease recognition methods.
Background technique
With the continuous development that China's the Internet infrastructure is built, website application service type is increasing, according to statistics, By the end of the year 2017, China's Websites quantity has reached 526.06 ten thousand, and webpage is even more countless.Webpage and application service at The important channel of the daily acquisition news of people, information.Due to the space particularity of network, the saved content in website before access not It is easily known, so being no lack of pornographic, gambling, violence, terror etc. in the webpage of these hundred million meters present on network server Harmful content, and these harmful content forms, keyword convert often, if harmful content of leaving spreads unchecked, propagates and cause disaster, gesture It must cause severe social influence.So how effectively to carry out harmfulness examination to web page contents, magnanimity also can satisfy The problem of performance requirement of data processing is at current urgent need to resolve.
Summary of the invention
It is an object of the invention to overcome the deficiencies of the prior art and provide a kind of webpage harmful information based on machine learning Recognition methods reaches and obtains corpus, training identification model by crawler, and then differentiates whether web page contents to be screened contain nocuousness The purpose of content.
The purpose of the present invention is achieved through the following technical solutions: a kind of webpage harmful information based on machine learning Recognition methods, comprising the following steps:
S1: known classifieds website machine learning training corpus used is crawled using web crawlers;
S2: pre-processing the corpus crawled, generates training set and test set;
S3: model training and the model verifying of machine learning algorithm are carried out;
S4: inputting webpage to be screened, and is classified by model to text, returns to category result and accuracy rate.
The step S2 includes following sub-step:
S201: rejecting html, extracts text information;
S202: preset keyword set rejects low quality text;
S203: training set and test set are generated.
The step S2 further includes following sub-step:
S101: using stop words method is gone, word useless to training and that frequency of occurrence is more is removed;
S102: using preset category keywords mode, filters the weak related corpus in corpus.
Model training and model verifying are carried out using sparse matrix storage language material feature in the step S3.
The beneficial effects of the present invention are:
1) the present invention is based on the algorithms of machine learning provides strong ability support to calculate;By crawler to known harmful letter Breath crawls, and obtains direct training material, ensure that the authenticity of material, validity;Have by the examination that is trained for model Evil information provides technological means, by characteristic matching, be found non-training text also can, is calculated using the classification based on machine Method, to text classification, obtain whether harmful result.
2) the present invention is based on machine learning, training pattern, Text Classifications, carry out Classification and Identification to the webpage of crawl, According to the generic of webpage recognition result, reaching examination webpage whether there is harmful information, further judges whether website deposits In the purpose of harmful information.
3) using the algorithm of the text classification based on machine learning, quickly web page contents to be identified can be divided Class reaches and identifies to harmful content, and has the effect of high-performance and high expansion.
Detailed description of the invention
Fig. 1 is step flow chart of the invention;
Fig. 2 is the flow chart that webpage of the present invention is screened.
Specific embodiment
Below in conjunction with embodiment, technical solution of the present invention is clearly and completely described, it is clear that described Embodiment is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, this field Technical staff's every other embodiment obtained under the premise of not making the creative labor belongs to what the present invention protected Range.
Refering to fig. 1-2, the present invention provides a kind of technical solution: a kind of webpage harmful information identification side based on machine learning Method, comprising the following steps:
S1: known classifieds website machine learning training corpus used is crawled using web crawlers, i.e., acquisition is known has The web page contents of evil information, the harmful information include such as pornographic, gambling, violence, terror.Using web crawlers to known nocuousness Information site carries out content and crawls, can be quickly to webpage to be identified using the algorithm of the text classification based on machine learning Content is classified, and is reached and is identified to harmful content, and has the effect of high-performance and high expansion.
Using the characteristic of machine learning, calculates web page contents characteristic value to be identified rather than transcription comparison, improve data Accuracy rate.
S2: pre-processing the corpus crawled, generates training set and test set;
The step S2 includes following sub-step:
S201: rejecting html, extracts text information, simplifies content, saves memory space;
S202: preset keyword set rejects low quality text, keeps data more accurate;
S203: training set and test set are generated.
The step S2 further includes following sub-step:
S101: using stop words method is gone, word useless to training and that frequency of occurrence is more, in information retrieval, place are removed Certain words or word are fallen in automatic fitration before or after reason natural language data (or text), save memory space, improve and search Rope efficiency.
S102: using preset category keywords mode, filters the weak related corpus in corpus, improves later period accuracy rate.
S3: model training and the model verifying of machine learning algorithm are carried out, uses training training using machine learning algorithm Disaggregated model is got, using the accuracy rate of test set verifying model, the model of acceptable accuracy rate is obtained, obtains of all categories Feature, verifying model accuracy rate only need re -training identification model when increasing new identification classification, convenient and efficient, utilize Machine learning algorithm can be used recognition result and carry out incremental training to model, improves the subsequent accuracy of identification of model, and can be with It was found that there is the evil information contents except training material, and then it can also reach the harmful information for crawling wider scope, into one Step optimization webpage.
Model training and model verifying are carried out using sparse matrix storage language material feature in the step S3, improves and calculates speed Degree, and save memory space.
S4: inputting webpage to be screened, and is classified by model to text, and returning to category result and accuracy rate, record has Evil webpage and harmful sites, are screened using to be screened web page contents of the model after training to input, obtain classification results And accuracy rate saves result and source if meeting preset threshold, judges whether website is harmful net according to preset rules It stands.
The present invention is based on the algorithms of machine learning to provide strong ability support to calculate;By crawler to known nocuousness Information crawler obtains direct training material, ensure that the authenticity of material, validity;By being trained for screening to model Harmful information provides technological means, by characteristic matching, be found non-training text also can, utilizes the classification based on machine Algorithm, to text classification, obtain whether harmful result.
The above is only a preferred embodiment of the present invention, it should be understood that the present invention is not limited to described herein Form should not be regarded as an exclusion of other examples, and can be used for other combinations, modifications, and environments, and can be at this In the text contemplated scope, modifications can be made through the above teachings or related fields of technology or knowledge.And those skilled in the art institute into Capable modifications and changes do not depart from the spirit and scope of the present invention, then all should be in the protection scope of appended claims of the present invention It is interior.

Claims (4)

1. a kind of webpage harmful information recognition methods based on machine learning, it is characterised in that: the following steps are included:
S1: known classifieds website machine learning training corpus used is crawled using web crawlers;
S2: pre-processing the corpus crawled, generates training set and test set;
S3: model training and the model verifying of machine learning algorithm are carried out;
S4: inputting webpage to be screened, and is classified by model to text, returns to category result and accuracy rate.
2. a kind of webpage harmful information recognition methods based on machine learning according to claim 1, it is characterised in that: institute Stating step S2 includes following sub-step:
S201: rejecting html, extracts text information;
S202: preset keyword set rejects low quality text;
S203: training set and test set are generated.
3. a kind of webpage harmful information recognition methods based on machine learning according to claim 1 or 2, feature exist In: the step S2 further includes following sub-step:
S101: using stop words method is gone, word useless to training and that frequency of occurrence is more is removed;
S102: using preset category keywords mode, filters the weak related corpus in corpus.
4. a kind of webpage harmful information recognition methods based on machine learning according to claim 1, it is characterised in that: institute It states in step S3 and model training and model verifying is carried out using sparse matrix storage language material feature.
CN201811302974.XA 2018-11-02 2018-11-02 Webpage harmful information identification method based on machine learning Pending CN109710825A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811302974.XA CN109710825A (en) 2018-11-02 2018-11-02 Webpage harmful information identification method based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811302974.XA CN109710825A (en) 2018-11-02 2018-11-02 Webpage harmful information identification method based on machine learning

Publications (1)

Publication Number Publication Date
CN109710825A true CN109710825A (en) 2019-05-03

Family

ID=66254281

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811302974.XA Pending CN109710825A (en) 2018-11-02 2018-11-02 Webpage harmful information identification method based on machine learning

Country Status (1)

Country Link
CN (1) CN109710825A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110414597A (en) * 2019-07-26 2019-11-05 博雅创智(天津)科技有限公司 The recognition methods of unartificial network request packet lines based on deep learning
CN111008347A (en) * 2019-11-25 2020-04-14 杭州安恒信息技术股份有限公司 Website identification method, device and system and computer readable storage medium
CN111209394A (en) * 2019-12-25 2020-05-29 国网北京市电力公司 Text classification processing method and device
CN112632355A (en) * 2020-11-26 2021-04-09 武汉虹旭信息技术有限责任公司 Fragment content processing method and device for harmful information
CN112650849A (en) * 2019-09-25 2021-04-13 北京国双科技有限公司 File processing method and device, storage medium and equipment
CN112837677A (en) * 2020-10-13 2021-05-25 讯飞智元信息科技有限公司 Harmful audio detection method and device
CN113657453A (en) * 2021-07-22 2021-11-16 珠海高凌信息科技股份有限公司 Harmful website detection method based on generation of countermeasure network and deep learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070294252A1 (en) * 2006-06-19 2007-12-20 Microsoft Corporation Identifying a web page as belonging to a blog
CN106886577A (en) * 2017-01-24 2017-06-23 淮阴工学院 A kind of various dimensions web page browsing behavior evaluation method
CN107315797A (en) * 2017-06-19 2017-11-03 江西洪都航空工业集团有限责任公司 A kind of Internet news is obtained and text emotion forecasting system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070294252A1 (en) * 2006-06-19 2007-12-20 Microsoft Corporation Identifying a web page as belonging to a blog
CN106886577A (en) * 2017-01-24 2017-06-23 淮阴工学院 A kind of various dimensions web page browsing behavior evaluation method
CN107315797A (en) * 2017-06-19 2017-11-03 江西洪都航空工业集团有限责任公司 A kind of Internet news is obtained and text emotion forecasting system

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110414597A (en) * 2019-07-26 2019-11-05 博雅创智(天津)科技有限公司 The recognition methods of unartificial network request packet lines based on deep learning
CN112650849A (en) * 2019-09-25 2021-04-13 北京国双科技有限公司 File processing method and device, storage medium and equipment
CN111008347A (en) * 2019-11-25 2020-04-14 杭州安恒信息技术股份有限公司 Website identification method, device and system and computer readable storage medium
CN111209394A (en) * 2019-12-25 2020-05-29 国网北京市电力公司 Text classification processing method and device
WO2021128721A1 (en) * 2019-12-25 2021-07-01 国网北京市电力公司 Method and device for text classification
CN112837677A (en) * 2020-10-13 2021-05-25 讯飞智元信息科技有限公司 Harmful audio detection method and device
CN112632355A (en) * 2020-11-26 2021-04-09 武汉虹旭信息技术有限责任公司 Fragment content processing method and device for harmful information
CN113657453A (en) * 2021-07-22 2021-11-16 珠海高凌信息科技股份有限公司 Harmful website detection method based on generation of countermeasure network and deep learning

Similar Documents

Publication Publication Date Title
CN109710825A (en) Webpage harmful information identification method based on machine learning
TWI735543B (en) Method and device for webpage text classification, method and device for webpage text recognition
Vadivukarassi et al. Sentimental analysis of tweets using Naive Bayes algorithm
CN111310476B (en) Public opinion monitoring method and system using aspect-based emotion analysis method
CN107885793A (en) A kind of hot microblog topic analyzing and predicting method and system
CN111324801B (en) Hot event discovery method in judicial field based on hot words
CN103577755A (en) Malicious script static detection method based on SVM (support vector machine)
CN109558587B (en) Method for classifying public opinion tendency recognition aiming at category distribution imbalance
CN105956740B (en) Semantic risk calculation method based on text logical features
CN108363784A (en) A kind of public sentiment trend estimate method based on text machine learning
CN103177036A (en) Method and system for label automatic extraction
CN109918621A (en) Newsletter archive infringement detection method and device based on digital finger-print and semantic feature
CN103246644A (en) Method and device for processing Internet public opinion information
CN112541476A (en) Malicious webpage identification method based on semantic feature extraction
CN109918648B (en) Rumor depth detection method based on dynamic sliding window feature score
Archchitha et al. Opinion spam detection in online reviews using neural networks
CN112559747A (en) Event classification processing method and device, electronic equipment and storage medium
CN104794209B (en) Chinese microblogging mood sorting technique based on Markov logical network and system
CN115329085A (en) Social robot classification method and system
CN107291686B (en) Method and system for identifying emotion identification
CN112434163A (en) Risk identification method, model construction method, risk identification device, electronic equipment and medium
CN116089732B (en) User preference identification method and system based on advertisement click data
CN110704611B (en) Illegal text recognition method and device based on feature de-interleaving
CN104281710A (en) Network data excavation method
CN110309387A (en) A kind of big data syndication reading recommended method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220518

Address after: 518000 22nd floor, building C, Shenzhen International Innovation Center (Futian science and Technology Plaza), No. 1006, Shennan Avenue, Xintian community, Huafu street, Futian District, Shenzhen, Guangdong Province

Applicant after: Shenzhen wanglian Anrui Network Technology Co.,Ltd.

Address before: Floor 4-8, unit 5, building 1, 333 Yunhua Road, high tech Zone, Chengdu, Sichuan 610041

Applicant before: CHENGDU 30KAITIAN COMMUNICATION INDUSTRY Co.,Ltd.

WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190503