CN107957872A - A kind of full web site source code acquisition methods and illegal website detection method, system - Google Patents

A kind of full web site source code acquisition methods and illegal website detection method, system Download PDF

Info

Publication number
CN107957872A
CN107957872A CN201710940131.1A CN201710940131A CN107957872A CN 107957872 A CN107957872 A CN 107957872A CN 201710940131 A CN201710940131 A CN 201710940131A CN 107957872 A CN107957872 A CN 107957872A
Authority
CN
China
Prior art keywords
source code
website
feature
complete source
text feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710940131.1A
Other languages
Chinese (zh)
Inventor
周发
袁晓彤
耿光刚
延志伟
李晓东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Internet Network Information Center
Original Assignee
China Internet Network Information Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Internet Network Information Center filed Critical China Internet Network Information Center
Priority to CN201710940131.1A priority Critical patent/CN107957872A/en
Publication of CN107957872A publication Critical patent/CN107957872A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/53Decompilation; Disassembly
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Abstract

The invention discloses a kind of full web site source code acquisition methods and illegal website detection method, system.The system includes complete source code extraction module, characteristic extracting module and illegal website identification model;Complete source code extraction module is used for the complete source code for extracting website;Characteristic extracting module is used for the text feature for extracting complete source code, obtains the text feature set of the complete source code;And extract the non-textual statistical nature of the complete source code;Feature in each text feature set is merged and calculates the mean eigenvalue of each feature, obtains text feature file;Illegal website identification model, for the complete source code feature and its characteristic value according to website to be identified, non-textual statistical nature, judges whether the website to be identified is illegal website;Based on the text feature set of the corresponding complete source code in each website, non-textual statistical nature and machine learning algorithm in sample site measure set, the illegal website identification model is generated.The present invention improves website identification accuracy.

Description

A kind of full web site source code acquisition methods and illegal website detection method, system
Technical field
The present invention relates to a kind of full web site source code acquisition methods and illegal website detection method, system, belong to network skill Art field.
Background technology
With the development of internet, internet has come into every field.But likewise as and come, internet Also used by some traditional illegal industries, be such as used for peddling gun, drugs, manage gambling, pornographic etc..Meanwhile these are not It is also possible to that wooden horse, virus etc. can be embedded into good website.To these abuses of internet, the health hair of serious threat internet Exhibition and infringement netizen physical and mental health and property safety.In order to detect objectionable website, researcher proposes the detection side of many Method.
Black and white lists are one of means that illegal website differentiates.Major browser manufacturer passes through regular renewal blacklist To have the function that the identification to illegal website and play to remind user.Although blacklist is effective, shortcoming is apparent.Respectively A browser manufacturer needs often, upgrade in time the blacklist, if an illegal website is not indexed to blacklist in time, The illegal website can not be identified.
It is also one of means for differentiating illegal website based on content of text heuritic approach.This kind of algorithm relies on The bad keyword and sentence of preset in advance identify objectionable website, will if website includes these keywords or sentence Take illegal website as in website.This kind of algorithm is too simple, be easy to cause misclassification, for normal website, such as News Network Stand, be considered as illegal website if containing some keywords or sentence.For illegal website, this kind of method As blacklist, if keyword or sentence coverage are inadequate, None- identified, normal net is regarded as by illegal website Stand.
As machine learning is widely applied, machine learning is also applied to differentiate illegal website.Naive Bayesian, god Through network, support vector machines, decision tree scheduling algorithm Chen-Huei Chou etc. paper《A text mining approach to Internet abuse detection》In the experiment proved that two classification illegal websites identification in have good effect Fruit.But the problem of text message is to obtain feature, still remain identification inaccuracy in source code is only used in the paper.
For the builder of illegal website, in order to hide the detection for being directed to its website, it also using many reverse-examinations and survey Technology, further increases detection difficulty.Website current at the same time is difficult to obtain its complete source code using conventional method, if nothing Method obtains the html codes really, being completely shown in browser, then is difficult to realize accurately detect website.
The content of the invention
For technical problem existing in the prior art, it is an object of the invention to provide a kind of full web site source code acquisition side Method and illegal website detection method, system.
The present invention has found that many websites can be used and JavaScript is used in itself webpage in the illegal website of acquisition Code dynamic load shows illegal contents, or JavaScript code is not positioned in own website code, but uses The mode of Asynchronous loading is obtained from other addresses, and only in browser resolves, JavaScript code can just perform.Meanwhile Present invention discover that some websites will not use illegal contents in the webpage source code of its own, but the content of illegal web page is embedding It is sleeved on<iframe>In label,<iframe>Label also only in the original webpage source code of browser resolves, can be just loaded Into the display page of browser.Illegal website is by using these methods so that tester can not be obtained by instruments such as wegt The html codes for taking objectionable website really, to be completely shown in browser.So if can not obtain it is real, be completely shown in it is clear The html codes look in device, then be difficult to realize accurately detect.The present invention considers the non-textual statistical nature of some in html, such as In html structures<iframe>The features such as number of labels, while in the present invention in actual use, find random forests algorithm Excellent effect.
The technical scheme is that:
A kind of full web site source code acquisition methods, its step include:
1) for each targeted website, the JavaScript code of the dynamic load targeted website is carried out using PhantomJS, Obtain the html codes performed after JavaScript;
2) URL in the label for initiating request is obtained from the html codes, the html codes of the URL is obtained and is added Add to the correspondence position of the complete source code of the targeted website;
3) recursion step 2) processing, obtain the final complete source code of the targeted website.
Further, the label of the initiation request is<iframe>Label.
Further, in the step 2), a timeout mechanism is set, if not receiving the sound of current URL in setting time Should, then stop the access request to the URL.
A kind of illegal website detection method, its step include:
Obtain the complete source code of website to be identified;Text feature file in the identification model of illegal website, is treated from this Identify in the complete source code of website and extract feature of the corresponding feature as the complete source code of the website to be identified, this is complete The characteristic value of the feature of source code is arranged to the characteristic value of character pair in the text feature file;Extract the website to be identified The non-textual statistical nature of complete source code;
The complete source code feature and its characteristic value of the website to be identified, non-textual statistical nature are inputted into the illegal website Identification model, judges whether the website to be identified is illegal website;
Wherein, the generation method of the illegal website identification model is:Obtain each website in sample site measure set Complete source code, obtains complete source code set;The text feature of each complete source code in the complete source code set is extracted, is somebody's turn to do The text feature set of complete source code;Extract the non-textual statistical nature of each complete source code;To in each text feature set Feature merge and calculate the mean eigenvalue of each feature, obtain text feature file;
It is special based on the text feature set of the corresponding complete source code in each website, non-textual statistics in the sample site measure set Seek peace machine learning algorithm, generate illegal website identification model.
Further, the method for obtaining the text feature file is:Chinese information in the complete source code is divided Word and the TF-IDF values for calculating each participle;The information gain value of participle is then based on, multiple participles is chosen and is used as the complete source The feature of code, then the text feature set using the feature of selection and its corresponding TF-IDF values as the complete source code;Will be each Feature in text feature set is merged and calculated according to TF-IDF value of the same feature in different text feature set The average TF-IDF values of this feature, the text feature file is generated according to the feature after merging and its average TF-IDF values.
Further, the statistical nature includes the complete source code<iframe>Number of labels,<title>Label is put down Equal length, the quantity of URL,<div>The quantity of label,<ul>Quantity, the quantity of & symbols.
Further, the machine learning algorithm is random forests algorithm.
A kind of illegal website detecting system, it is characterised in that including complete source code extraction module, characteristic extracting module and non- Net of justice station identification model;Wherein,
The complete source code extraction module, for extracting the complete source code of website;The website include website to be identified and Each website in sample site measure set;The sample site measure set includes multiple illegal websites and multiple legitimate sites;
The characteristic extracting module, for extracting the text feature of complete source code, obtains the text feature of the complete source code Set;And extract the non-textual statistical nature of the complete source code;Feature in each text feature set is merged simultaneously The mean eigenvalue of each feature is calculated, obtains text feature file;
The illegal website identification model, for the complete source code feature and its characteristic value according to website to be identified, non-text This statistical nature, judges whether the website to be identified is illegal website;
Wherein, based on the text feature set of the corresponding complete source code in each website, non-textual system in sample site measure set Feature and machine learning algorithm are counted, generates the illegal website identification model.
The present invention is directed to these anti-detection methods of illegal website, imitates the resolving of browser, really aobvious to obtain It is shown in the html codes of browser.Carry out dynamic load JavaScript code first by PhantomJS, acquisition has performed Html codes after JavaScript.Then from the html source codes got, obtain<iframe>URL in label, is Preventing multinest, recurrence of the present invention is repeated the above process using these URL, finally obtain true, complete html codes, It with the addition of timeout mechanism in code at the same time, for the website that cannot respond to, time-out is not asked, it is therefore prevented that the possibility of blocking.
After it can obtain true, complete html codes.The present invention obtains illegal website using this program Html codes, and normal web page code, complete the structure of data set.The present invention is extracted each first by TF-IDF The text feature of website source code, since the dimension of generation is excessive, the present invention then carries out feature selecting.The present invention calculates choosing at the same time The average TF-IDF values of feature are selected, are preserved the value with corresponding feature as tag file.The present invention in feature extraction phases, In addition to the feature obtained using TF-IDF, also extraction contains the non-textual statistical nature in html codes, such as picture number Amount,<iframe>Quantity, number of links,<div>Number of labels etc..By the use to these features, calculation is effectively improved The accuracy rate and recall rate of method.
The present invention is developed into browser plug-in in service stage.When user accesses some webpage, webpage is obtained Html codes, feature is extracted using the tag file preserved above, if including word present in tag file in webpage, this This feature value is just arranged to the corresponding TF-IDF values of the word in tag file by invention.Then the knot of the webpage is obtained using model Fruit.
Compared with prior art, the positive effect of the present invention is:
(1) due to imitating browser resolves process, so as to obtain true, full web site html codes so that bad The reverse-examination of website looks into measure failure, provides standard for the subsequent machine learning using web page text feature and structure of web page feature True complete feature set, improves the accuracy of algorithm.
(2) in extraction feature stage, except using and the text feature of webpage in addition to, it also is contemplated that it is non-in html codes Text statistical nature, end product contrast is found, by increasing these features so that the recognition effect of illegal website improves 4%.
Brief description of the drawings
Fig. 1 is flow chart of the method for the present invention.
Embodiment
To enable the features described above of the present invention and advantage to become apparent, special embodiment below, and coordinate institute's attached drawing to make Describe in detail.
The present invention extracts feature by capturing true, complete html codes, from html codes, then using the spy of extraction Sign obtains illegal website identification model as training set training random forest.Identify whether website is illegal using the model Website.All processes are realized using Python.The method flow of the present invention is as shown in Figure 1, its step includes:
1) present invention uses PhantomJS, PhantomJS is browsed for the script that is interacted automatically with webpage is without a head Device, i.e. PhantomJS do not have UI interfaces, there is provided a JavaScript API, it is possible to achieve the row such as self-navigation, sectional drawing To become Safari browser environments similar with Chrome's.The present invention is realized pair using PhantomJS The dynamic load of JavaScript scripts, so as to obtain the html codes for having performed JavaScript code.
2) after acquisition has loaded the html codes of JavaScript code, the present invention is then gone in parsing html codes When, browser can re-initiate the label of request, refer mainly to here<iframe>Label, many illegal websites can use< iframe>Label come realize it is shown in browser interface html codes hide.Website multilayer nest in order to prevent, this Invention is used here recursively mode and obtains the html codes of URL in iframe labels, is then added to corresponding Position, by the above process, can obtain complete real html codes.
3) present invention constructs one and includes 1800 URL, and wherein illegal website contains gun, pornographic, gambling, these Website is all from discovery of the present invention in actual process.Normal sample contains the website of each theme, which part URL uses search expression " site:.cn " obtained from search engine, it is considered herein that the URL obtained by this way can be with Ensure to be similarly from the website obtained in actual process of the present invention for normal website, remainder.Then, use is above-mentioned Acquisition html codes method to data concentrate domain Name acquisition html codes.
4) from the html codes of acquisition, the present invention obtains all Chinese texts in html codes first, uses jieba points Word, is segmented for the Chinese information in each webpage html codes, the input using the result of participle as TF-IDF, so that Obtain including the TF-IDF values of each participle word.Since webpage quantity is more, and content of text is more in webpage, so using The characteristic dimension that TF-IDF is obtained is more, so make choice unusual necessity to feature, the present invention is based on information gain, by successively decreasing 600 features before selection information gain value, construct the text feature set based on html texts, while the present invention calculates feature Value average TF-IDF values in these training sets, then preserve feature and the value as tag file.Obtaining html texts While eigen set, the present invention extracts non-textual statistical nature, including<iframe>Number of labels,<title> The average length of label, the quantity of URL,<div>The quantity of label,<ul>Quantity, the quantity of & symbols.
5) after obtaining feature set, the present invention will be divided into two using the data set obtained in step 4) in the data set Part, wherein 1500 data, as training set, 300 collect as verification, the present invention trains prison using the training set of structure The random forests algorithm that educational inspector practises, so that an illegal website identification model is obtained, then using verification collection come to the non-net of justice The effect for identification model of standing is assessed, and finally finds that Average Accuracy, recall rate and the f1 values of the model all reach 96%.
6) after the completion of training, the present invention preserves the illegal website identification model, and browser is developed into using Python Plug-in unit, when user browses webpage, which obtains html codes, then using the spy of the tag file extraction webpage preserved Sign, when including the word in tag file in webpage, i.e., sets the word feature value matched in webpage (i.e. TF-IDF values) The corresponding characteristic value of this feature in tag file (i.e. TF-IDF values) is set to, while also extracts the non-textual feature of webpage.Obtain After feature, which will call the model preserved, using the feature vector of the webpage as input, judge whether the webpage is non- Method.
Implement to be merely illustrative of the technical solution of the present invention rather than be limited above, the ordinary skill people of this area Member can be to technical scheme technical scheme is modified or replaced equivalently, without departing from the spirit and scope of the present invention, this hair Bright protection domain should be subject to described in claims.

Claims (10)

1. a kind of full web site source code acquisition methods, its step include:
1) for each targeted website, the JavaScript code of the dynamic load targeted website is carried out using PhantomJS, is obtained The html codes after JavaScript are performed;
2) URL in the label for initiating request is obtained from the html codes, the html codes of the URL is obtained and adds it to The correspondence position of the complete source code of the targeted website;
3) recursion step 2) processing, obtain the final complete source code of the targeted website.
2. the method as described in claim 1, it is characterised in that it is described initiate request label be<iframe>Label.
3. method as claimed in claim 1 or 2, it is characterised in that in the step 2), a timeout mechanism is set, if set The response of current URL is not received in fixing time, then stops the access request to the URL.
4. a kind of illegal website detection method, its step include:
Obtain the complete source code of website to be identified;Text feature file in the identification model of illegal website, it is to be identified from this Feature of the corresponding feature as the complete source code of the website to be identified is extracted in the complete source code of website, by the complete source code The characteristic value of feature be arranged to the characteristic value of character pair in the text feature file;Extract the complete of the website to be identified The non-textual statistical nature of source code;
The complete source code feature and its characteristic value of the website to be identified, non-textual statistical nature are inputted into illegal website identification Model, judges whether the website to be identified is illegal website;
Wherein, the generation method of the illegal website identification model is:Obtain the complete of each website in sample site measure set Source code, obtains complete source code set;The text feature of each complete source code in the complete source code set is extracted, it is complete to obtain this The text feature set of source code;Extract the non-textual statistical nature of each complete source code;To the spy in each text feature set Sign merges and calculates the mean eigenvalue of each feature, obtains text feature file;
Based on the text feature set of the corresponding complete source code in each website in the sample site measure set, non-textual statistical nature and Machine learning algorithm, generates illegal website identification model.
5. the method as described in right wants 4, it is characterised in that the method for obtaining the text feature file is:To the complete source Chinese information in code is segmented and calculates the TF-IDF values of each participle;The information gain value of participle is then based on, is chosen Feature of multiple participles as the complete source code, then using the feature of selection and its corresponding TF-IDF values as the complete source code Text feature set;Feature in each text feature set is merged and according to same feature in different text feature collection TF-IDF values in conjunction calculate the average TF-IDF values of this feature, according to the feature after merging and its average TF-IDF values generation institute State text feature file.
6. the method as described in right wants 4, it is characterised in that the statistical nature includes the complete source code<iframe>Label Quantity,<title>The average length of label, the quantity of URL,<div>The quantity of label,<ul>Quantity, the quantity of & symbols.
7. method as claimed in claim 4, it is characterised in that the machine learning algorithm is random forests algorithm.
8. a kind of illegal website detecting system, it is characterised in that including complete source code extraction module, characteristic extracting module and illegal Website identification model;Wherein,
The complete source code extraction module, for extracting the complete source code of website;The website includes website to be identified and sample Each website in set of websites;The sample site measure set includes multiple illegal websites and multiple legitimate sites;
The characteristic extracting module, for extracting the text feature of complete source code, obtains the text feature set of the complete source code; And extract the non-textual statistical nature of the complete source code;Feature in each text feature set is merged and is calculated every The mean eigenvalue of one feature, obtains text feature file;
The illegal website identification model, for the complete source code feature and its characteristic value according to website to be identified, non-textual Statistical nature, judges whether the website to be identified is illegal website;
Wherein, it is special based on the text feature set of the corresponding complete source code in each website, non-textual statistics in sample site measure set Seek peace machine learning algorithm, generate the illegal website identification model.
9. system as claimed in claim 8, it is characterised in that the characteristic extracting module believes the Chinese in the complete source code Breath is segmented and calculates the TF-IDF values of each participle;The information gain value of participle is then based on, chooses multiple participle conducts The feature of the complete source code, then the text feature collection using the feature of selection and its corresponding TF-IDF values as the complete source code Close;Feature in each text feature set is merged and the TF-IDF according to same feature in different text feature set Value calculates the average TF-IDF values of this feature, and the text feature text is generated according to the feature after merging and its average TF-IDF values Part.
10. system as claimed in claim 8, it is characterised in that the statistical nature includes the complete source code<iframe> Number of labels,<title>The average length of label, the quantity of URL,<div>The quantity of label,<ul>Quantity, the number of & symbols Amount.
CN201710940131.1A 2017-10-11 2017-10-11 A kind of full web site source code acquisition methods and illegal website detection method, system Pending CN107957872A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710940131.1A CN107957872A (en) 2017-10-11 2017-10-11 A kind of full web site source code acquisition methods and illegal website detection method, system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710940131.1A CN107957872A (en) 2017-10-11 2017-10-11 A kind of full web site source code acquisition methods and illegal website detection method, system

Publications (1)

Publication Number Publication Date
CN107957872A true CN107957872A (en) 2018-04-24

Family

ID=61953975

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710940131.1A Pending CN107957872A (en) 2017-10-11 2017-10-11 A kind of full web site source code acquisition methods and illegal website detection method, system

Country Status (1)

Country Link
CN (1) CN107957872A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108875060A (en) * 2018-06-29 2018-11-23 成都市映潮科技股份有限公司 A kind of website identification method and identifying system
CN109522454A (en) * 2018-11-20 2019-03-26 四川长虹电器股份有限公司 The method for automatically generating web sample data
CN110138794A (en) * 2019-05-22 2019-08-16 杭州安恒信息技术股份有限公司 A kind of counterfeit website identification method, device, equipment and readable storage medium storing program for executing
CN110275958A (en) * 2019-06-26 2019-09-24 北京市博汇科技股份有限公司 Site information recognition methods, device and electronic equipment
US20190332968A1 (en) * 2018-04-29 2019-10-31 Microsoft Technology Licensing, Llc. Code completion for languages with hierarchical structures
CN111311411A (en) * 2020-02-14 2020-06-19 北京三快在线科技有限公司 Illegal behavior identification method and device
CN111339453A (en) * 2018-12-19 2020-06-26 顺丰科技有限公司 Navigation page distinguishing method and device
WO2020151173A1 (en) * 2019-01-25 2020-07-30 深信服科技股份有限公司 Webpage tampering detection method and related apparatus
CN111506791A (en) * 2020-04-10 2020-08-07 安徽博约信息科技股份有限公司 Method for monitoring medical content of affiliated network station
CN112347244A (en) * 2019-08-08 2021-02-09 四川大学 Method for detecting website involved in yellow and gambling based on mixed feature analysis
CN112347402A (en) * 2020-10-21 2021-02-09 上海淇玥信息技术有限公司 Illegal website/APP automatic identification method, system and electronic device
CN114553486A (en) * 2022-01-20 2022-05-27 北京百度网讯科技有限公司 Illegal data processing method and device, electronic equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1372334A2 (en) * 1995-05-08 2003-12-17 Digimarc Corporation Method of embedding a machine readable steganographic code
CN101436210A (en) * 2008-12-16 2009-05-20 北京百问百答网络技术有限公司 Method and system for recognizing counterfeit web page
CN101520796A (en) * 2009-02-16 2009-09-02 深圳市腾讯计算机系统有限公司 Method and system for extracting uniform resource locators from web page content
CN102332028A (en) * 2011-10-15 2012-01-25 西安交通大学 Webpage-oriented unhealthy Web content identifying method
CN103544210A (en) * 2013-09-02 2014-01-29 烟台中科网络技术研究所 System and method for identifying webpage types
CN104714980A (en) * 2013-12-17 2015-06-17 阿里巴巴集团控股有限公司 Page nesting path determination method and device
CN105704099A (en) * 2014-11-26 2016-06-22 国家电网公司 Method for detecting illegal links hidden in website scripts
CN107181730A (en) * 2017-03-13 2017-09-19 烟台中科网络技术研究所 A kind of counterfeit website monitoring recognition methods and system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1372334A2 (en) * 1995-05-08 2003-12-17 Digimarc Corporation Method of embedding a machine readable steganographic code
CN101436210A (en) * 2008-12-16 2009-05-20 北京百问百答网络技术有限公司 Method and system for recognizing counterfeit web page
CN101520796A (en) * 2009-02-16 2009-09-02 深圳市腾讯计算机系统有限公司 Method and system for extracting uniform resource locators from web page content
CN102332028A (en) * 2011-10-15 2012-01-25 西安交通大学 Webpage-oriented unhealthy Web content identifying method
CN103544210A (en) * 2013-09-02 2014-01-29 烟台中科网络技术研究所 System and method for identifying webpage types
CN104714980A (en) * 2013-12-17 2015-06-17 阿里巴巴集团控股有限公司 Page nesting path determination method and device
CN105704099A (en) * 2014-11-26 2016-06-22 国家电网公司 Method for detecting illegal links hidden in website scripts
CN107181730A (en) * 2017-03-13 2017-09-19 烟台中科网络技术研究所 A kind of counterfeit website monitoring recognition methods and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
小木人印象: "测试获取iframe加载后内容或调用其内部方法示例(兼容主流浏览器)", 《WWW.XWOOD.NET/_SITE_DOMAIN_/_ROOT/5870/5874/T_C262630.HTML》 *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190332968A1 (en) * 2018-04-29 2019-10-31 Microsoft Technology Licensing, Llc. Code completion for languages with hierarchical structures
US11645576B2 (en) * 2018-04-29 2023-05-09 Microsoft Technology Licensing, Llc. Code completion for languages with hierarchical structures
CN108875060B (en) * 2018-06-29 2021-02-26 成都市映潮科技股份有限公司 Website identification method and identification system
CN108875060A (en) * 2018-06-29 2018-11-23 成都市映潮科技股份有限公司 A kind of website identification method and identifying system
CN109522454A (en) * 2018-11-20 2019-03-26 四川长虹电器股份有限公司 The method for automatically generating web sample data
CN109522454B (en) * 2018-11-20 2022-06-03 四川长虹电器股份有限公司 Method for automatically generating web sample data
CN111339453A (en) * 2018-12-19 2020-06-26 顺丰科技有限公司 Navigation page distinguishing method and device
WO2020151173A1 (en) * 2019-01-25 2020-07-30 深信服科技股份有限公司 Webpage tampering detection method and related apparatus
CN111488623A (en) * 2019-01-25 2020-08-04 深信服科技股份有限公司 Webpage tampering detection method and related device
CN110138794A (en) * 2019-05-22 2019-08-16 杭州安恒信息技术股份有限公司 A kind of counterfeit website identification method, device, equipment and readable storage medium storing program for executing
CN110275958A (en) * 2019-06-26 2019-09-24 北京市博汇科技股份有限公司 Site information recognition methods, device and electronic equipment
CN110275958B (en) * 2019-06-26 2021-07-27 北京市博汇科技股份有限公司 Website information identification method and device and electronic equipment
CN112347244A (en) * 2019-08-08 2021-02-09 四川大学 Method for detecting website involved in yellow and gambling based on mixed feature analysis
CN111311411B (en) * 2020-02-14 2022-03-08 北京三快在线科技有限公司 Illegal behavior identification method and device
CN111311411A (en) * 2020-02-14 2020-06-19 北京三快在线科技有限公司 Illegal behavior identification method and device
CN111506791A (en) * 2020-04-10 2020-08-07 安徽博约信息科技股份有限公司 Method for monitoring medical content of affiliated network station
CN112347402A (en) * 2020-10-21 2021-02-09 上海淇玥信息技术有限公司 Illegal website/APP automatic identification method, system and electronic device
CN114553486A (en) * 2022-01-20 2022-05-27 北京百度网讯科技有限公司 Illegal data processing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107957872A (en) A kind of full web site source code acquisition methods and illegal website detection method, system
US8838992B1 (en) Identification of normal scripts in computer systems
Blum et al. Lexical feature based phishing URL detection using online learning
US8943588B1 (en) Detecting unauthorized websites
Dunlop et al. Goldphish: Using images for content-based phishing analysis
US9521161B2 (en) Method and apparatus for detecting computer fraud
US20150244728A1 (en) Method and device for detecting malicious url
US10311120B2 (en) Method and apparatus for identifying webpage type
CN106230831B (en) A kind of method and system identifying browser uniqueness and feature of risk
CN107862050A (en) A kind of web site contents safety detecting system and method
CN108038173B (en) Webpage classification method and system and webpage classification equipment
CN108881138B (en) Webpage request identification method and device
US20220030029A1 (en) Phishing Protection Methods and Systems
CN110572359A (en) Phishing webpage detection method based on machine learning
CN106682489A (en) Password security detection method, password security reminding method and corresponding devices
CN109858248A (en) Malice Word document detection method and device
CN107786537A (en) A kind of lonely page implantation attack detection method based on internet intersection search
CN109922065A (en) Malicious websites method for quickly identifying
CN106060038B (en) Detection method for phishing site based on client-side program behavioural analysis
Kumar et al. edarkfind: Unsupervised multi-view learning for sybil account detection
JP7182764B2 (en) Fraudulent web page detection device, control method and control program for fraudulent web page detection device
CN106357682A (en) Phishing website detecting method
Sonowal et al. Masphid: a model to assist screen reader users for detecting phishing sites using aural and visual similarity measures
CN114048480A (en) Vulnerability detection method, device, equipment and storage medium
Thao et al. Hunting brand domain forgery: a scalable classification for homograph attack

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180424

RJ01 Rejection of invention patent application after publication