CN107957872A - A kind of full web site source code acquisition methods and illegal website detection method, system - Google Patents
A kind of full web site source code acquisition methods and illegal website detection method, system Download PDFInfo
- Publication number
- CN107957872A CN107957872A CN201710940131.1A CN201710940131A CN107957872A CN 107957872 A CN107957872 A CN 107957872A CN 201710940131 A CN201710940131 A CN 201710940131A CN 107957872 A CN107957872 A CN 107957872A
- Authority
- CN
- China
- Prior art keywords
- source code
- website
- feature
- complete source
- text feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/53—Decompilation; Disassembly
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
- G06F16/986—Document structures and storage, e.g. HTML extensions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
Abstract
The invention discloses a kind of full web site source code acquisition methods and illegal website detection method, system.The system includes complete source code extraction module, characteristic extracting module and illegal website identification model;Complete source code extraction module is used for the complete source code for extracting website;Characteristic extracting module is used for the text feature for extracting complete source code, obtains the text feature set of the complete source code;And extract the non-textual statistical nature of the complete source code;Feature in each text feature set is merged and calculates the mean eigenvalue of each feature, obtains text feature file;Illegal website identification model, for the complete source code feature and its characteristic value according to website to be identified, non-textual statistical nature, judges whether the website to be identified is illegal website;Based on the text feature set of the corresponding complete source code in each website, non-textual statistical nature and machine learning algorithm in sample site measure set, the illegal website identification model is generated.The present invention improves website identification accuracy.
Description
Technical field
The present invention relates to a kind of full web site source code acquisition methods and illegal website detection method, system, belong to network skill
Art field.
Background technology
With the development of internet, internet has come into every field.But likewise as and come, internet
Also used by some traditional illegal industries, be such as used for peddling gun, drugs, manage gambling, pornographic etc..Meanwhile these are not
It is also possible to that wooden horse, virus etc. can be embedded into good website.To these abuses of internet, the health hair of serious threat internet
Exhibition and infringement netizen physical and mental health and property safety.In order to detect objectionable website, researcher proposes the detection side of many
Method.
Black and white lists are one of means that illegal website differentiates.Major browser manufacturer passes through regular renewal blacklist
To have the function that the identification to illegal website and play to remind user.Although blacklist is effective, shortcoming is apparent.Respectively
A browser manufacturer needs often, upgrade in time the blacklist, if an illegal website is not indexed to blacklist in time,
The illegal website can not be identified.
It is also one of means for differentiating illegal website based on content of text heuritic approach.This kind of algorithm relies on
The bad keyword and sentence of preset in advance identify objectionable website, will if website includes these keywords or sentence
Take illegal website as in website.This kind of algorithm is too simple, be easy to cause misclassification, for normal website, such as News Network
Stand, be considered as illegal website if containing some keywords or sentence.For illegal website, this kind of method
As blacklist, if keyword or sentence coverage are inadequate, None- identified, normal net is regarded as by illegal website
Stand.
As machine learning is widely applied, machine learning is also applied to differentiate illegal website.Naive Bayesian, god
Through network, support vector machines, decision tree scheduling algorithm Chen-Huei Chou etc. paper《A text mining approach
to Internet abuse detection》In the experiment proved that two classification illegal websites identification in have good effect
Fruit.But the problem of text message is to obtain feature, still remain identification inaccuracy in source code is only used in the paper.
For the builder of illegal website, in order to hide the detection for being directed to its website, it also using many reverse-examinations and survey
Technology, further increases detection difficulty.Website current at the same time is difficult to obtain its complete source code using conventional method, if nothing
Method obtains the html codes really, being completely shown in browser, then is difficult to realize accurately detect website.
The content of the invention
For technical problem existing in the prior art, it is an object of the invention to provide a kind of full web site source code acquisition side
Method and illegal website detection method, system.
The present invention has found that many websites can be used and JavaScript is used in itself webpage in the illegal website of acquisition
Code dynamic load shows illegal contents, or JavaScript code is not positioned in own website code, but uses
The mode of Asynchronous loading is obtained from other addresses, and only in browser resolves, JavaScript code can just perform.Meanwhile
Present invention discover that some websites will not use illegal contents in the webpage source code of its own, but the content of illegal web page is embedding
It is sleeved on<iframe>In label,<iframe>Label also only in the original webpage source code of browser resolves, can be just loaded
Into the display page of browser.Illegal website is by using these methods so that tester can not be obtained by instruments such as wegt
The html codes for taking objectionable website really, to be completely shown in browser.So if can not obtain it is real, be completely shown in it is clear
The html codes look in device, then be difficult to realize accurately detect.The present invention considers the non-textual statistical nature of some in html, such as
In html structures<iframe>The features such as number of labels, while in the present invention in actual use, find random forests algorithm
Excellent effect.
The technical scheme is that:
A kind of full web site source code acquisition methods, its step include:
1) for each targeted website, the JavaScript code of the dynamic load targeted website is carried out using PhantomJS,
Obtain the html codes performed after JavaScript;
2) URL in the label for initiating request is obtained from the html codes, the html codes of the URL is obtained and is added
Add to the correspondence position of the complete source code of the targeted website;
3) recursion step 2) processing, obtain the final complete source code of the targeted website.
Further, the label of the initiation request is<iframe>Label.
Further, in the step 2), a timeout mechanism is set, if not receiving the sound of current URL in setting time
Should, then stop the access request to the URL.
A kind of illegal website detection method, its step include:
Obtain the complete source code of website to be identified;Text feature file in the identification model of illegal website, is treated from this
Identify in the complete source code of website and extract feature of the corresponding feature as the complete source code of the website to be identified, this is complete
The characteristic value of the feature of source code is arranged to the characteristic value of character pair in the text feature file;Extract the website to be identified
The non-textual statistical nature of complete source code;
The complete source code feature and its characteristic value of the website to be identified, non-textual statistical nature are inputted into the illegal website
Identification model, judges whether the website to be identified is illegal website;
Wherein, the generation method of the illegal website identification model is:Obtain each website in sample site measure set
Complete source code, obtains complete source code set;The text feature of each complete source code in the complete source code set is extracted, is somebody's turn to do
The text feature set of complete source code;Extract the non-textual statistical nature of each complete source code;To in each text feature set
Feature merge and calculate the mean eigenvalue of each feature, obtain text feature file;
It is special based on the text feature set of the corresponding complete source code in each website, non-textual statistics in the sample site measure set
Seek peace machine learning algorithm, generate illegal website identification model.
Further, the method for obtaining the text feature file is:Chinese information in the complete source code is divided
Word and the TF-IDF values for calculating each participle;The information gain value of participle is then based on, multiple participles is chosen and is used as the complete source
The feature of code, then the text feature set using the feature of selection and its corresponding TF-IDF values as the complete source code;Will be each
Feature in text feature set is merged and calculated according to TF-IDF value of the same feature in different text feature set
The average TF-IDF values of this feature, the text feature file is generated according to the feature after merging and its average TF-IDF values.
Further, the statistical nature includes the complete source code<iframe>Number of labels,<title>Label is put down
Equal length, the quantity of URL,<div>The quantity of label,<ul>Quantity, the quantity of & symbols.
Further, the machine learning algorithm is random forests algorithm.
A kind of illegal website detecting system, it is characterised in that including complete source code extraction module, characteristic extracting module and non-
Net of justice station identification model;Wherein,
The complete source code extraction module, for extracting the complete source code of website;The website include website to be identified and
Each website in sample site measure set;The sample site measure set includes multiple illegal websites and multiple legitimate sites;
The characteristic extracting module, for extracting the text feature of complete source code, obtains the text feature of the complete source code
Set;And extract the non-textual statistical nature of the complete source code;Feature in each text feature set is merged simultaneously
The mean eigenvalue of each feature is calculated, obtains text feature file;
The illegal website identification model, for the complete source code feature and its characteristic value according to website to be identified, non-text
This statistical nature, judges whether the website to be identified is illegal website;
Wherein, based on the text feature set of the corresponding complete source code in each website, non-textual system in sample site measure set
Feature and machine learning algorithm are counted, generates the illegal website identification model.
The present invention is directed to these anti-detection methods of illegal website, imitates the resolving of browser, really aobvious to obtain
It is shown in the html codes of browser.Carry out dynamic load JavaScript code first by PhantomJS, acquisition has performed
Html codes after JavaScript.Then from the html source codes got, obtain<iframe>URL in label, is
Preventing multinest, recurrence of the present invention is repeated the above process using these URL, finally obtain true, complete html codes,
It with the addition of timeout mechanism in code at the same time, for the website that cannot respond to, time-out is not asked, it is therefore prevented that the possibility of blocking.
After it can obtain true, complete html codes.The present invention obtains illegal website using this program
Html codes, and normal web page code, complete the structure of data set.The present invention is extracted each first by TF-IDF
The text feature of website source code, since the dimension of generation is excessive, the present invention then carries out feature selecting.The present invention calculates choosing at the same time
The average TF-IDF values of feature are selected, are preserved the value with corresponding feature as tag file.The present invention in feature extraction phases,
In addition to the feature obtained using TF-IDF, also extraction contains the non-textual statistical nature in html codes, such as picture number
Amount,<iframe>Quantity, number of links,<div>Number of labels etc..By the use to these features, calculation is effectively improved
The accuracy rate and recall rate of method.
The present invention is developed into browser plug-in in service stage.When user accesses some webpage, webpage is obtained
Html codes, feature is extracted using the tag file preserved above, if including word present in tag file in webpage, this
This feature value is just arranged to the corresponding TF-IDF values of the word in tag file by invention.Then the knot of the webpage is obtained using model
Fruit.
Compared with prior art, the positive effect of the present invention is:
(1) due to imitating browser resolves process, so as to obtain true, full web site html codes so that bad
The reverse-examination of website looks into measure failure, provides standard for the subsequent machine learning using web page text feature and structure of web page feature
True complete feature set, improves the accuracy of algorithm.
(2) in extraction feature stage, except using and the text feature of webpage in addition to, it also is contemplated that it is non-in html codes
Text statistical nature, end product contrast is found, by increasing these features so that the recognition effect of illegal website improves
4%.
Brief description of the drawings
Fig. 1 is flow chart of the method for the present invention.
Embodiment
To enable the features described above of the present invention and advantage to become apparent, special embodiment below, and coordinate institute's attached drawing to make
Describe in detail.
The present invention extracts feature by capturing true, complete html codes, from html codes, then using the spy of extraction
Sign obtains illegal website identification model as training set training random forest.Identify whether website is illegal using the model
Website.All processes are realized using Python.The method flow of the present invention is as shown in Figure 1, its step includes:
1) present invention uses PhantomJS, PhantomJS is browsed for the script that is interacted automatically with webpage is without a head
Device, i.e. PhantomJS do not have UI interfaces, there is provided a JavaScript API, it is possible to achieve the row such as self-navigation, sectional drawing
To become Safari browser environments similar with Chrome's.The present invention is realized pair using PhantomJS
The dynamic load of JavaScript scripts, so as to obtain the html codes for having performed JavaScript code.
2) after acquisition has loaded the html codes of JavaScript code, the present invention is then gone in parsing html codes
When, browser can re-initiate the label of request, refer mainly to here<iframe>Label, many illegal websites can use<
iframe>Label come realize it is shown in browser interface html codes hide.Website multilayer nest in order to prevent, this
Invention is used here recursively mode and obtains the html codes of URL in iframe labels, is then added to corresponding
Position, by the above process, can obtain complete real html codes.
3) present invention constructs one and includes 1800 URL, and wherein illegal website contains gun, pornographic, gambling, these
Website is all from discovery of the present invention in actual process.Normal sample contains the website of each theme, which part
URL uses search expression " site:.cn " obtained from search engine, it is considered herein that the URL obtained by this way can be with
Ensure to be similarly from the website obtained in actual process of the present invention for normal website, remainder.Then, use is above-mentioned
Acquisition html codes method to data concentrate domain Name acquisition html codes.
4) from the html codes of acquisition, the present invention obtains all Chinese texts in html codes first, uses jieba points
Word, is segmented for the Chinese information in each webpage html codes, the input using the result of participle as TF-IDF, so that
Obtain including the TF-IDF values of each participle word.Since webpage quantity is more, and content of text is more in webpage, so using
The characteristic dimension that TF-IDF is obtained is more, so make choice unusual necessity to feature, the present invention is based on information gain, by successively decreasing
600 features before selection information gain value, construct the text feature set based on html texts, while the present invention calculates feature
Value average TF-IDF values in these training sets, then preserve feature and the value as tag file.Obtaining html texts
While eigen set, the present invention extracts non-textual statistical nature, including<iframe>Number of labels,<title>
The average length of label, the quantity of URL,<div>The quantity of label,<ul>Quantity, the quantity of & symbols.
5) after obtaining feature set, the present invention will be divided into two using the data set obtained in step 4) in the data set
Part, wherein 1500 data, as training set, 300 collect as verification, the present invention trains prison using the training set of structure
The random forests algorithm that educational inspector practises, so that an illegal website identification model is obtained, then using verification collection come to the non-net of justice
The effect for identification model of standing is assessed, and finally finds that Average Accuracy, recall rate and the f1 values of the model all reach 96%.
6) after the completion of training, the present invention preserves the illegal website identification model, and browser is developed into using Python
Plug-in unit, when user browses webpage, which obtains html codes, then using the spy of the tag file extraction webpage preserved
Sign, when including the word in tag file in webpage, i.e., sets the word feature value matched in webpage (i.e. TF-IDF values)
The corresponding characteristic value of this feature in tag file (i.e. TF-IDF values) is set to, while also extracts the non-textual feature of webpage.Obtain
After feature, which will call the model preserved, using the feature vector of the webpage as input, judge whether the webpage is non-
Method.
Implement to be merely illustrative of the technical solution of the present invention rather than be limited above, the ordinary skill people of this area
Member can be to technical scheme technical scheme is modified or replaced equivalently, without departing from the spirit and scope of the present invention, this hair
Bright protection domain should be subject to described in claims.
Claims (10)
1. a kind of full web site source code acquisition methods, its step include:
1) for each targeted website, the JavaScript code of the dynamic load targeted website is carried out using PhantomJS, is obtained
The html codes after JavaScript are performed;
2) URL in the label for initiating request is obtained from the html codes, the html codes of the URL is obtained and adds it to
The correspondence position of the complete source code of the targeted website;
3) recursion step 2) processing, obtain the final complete source code of the targeted website.
2. the method as described in claim 1, it is characterised in that it is described initiate request label be<iframe>Label.
3. method as claimed in claim 1 or 2, it is characterised in that in the step 2), a timeout mechanism is set, if set
The response of current URL is not received in fixing time, then stops the access request to the URL.
4. a kind of illegal website detection method, its step include:
Obtain the complete source code of website to be identified;Text feature file in the identification model of illegal website, it is to be identified from this
Feature of the corresponding feature as the complete source code of the website to be identified is extracted in the complete source code of website, by the complete source code
The characteristic value of feature be arranged to the characteristic value of character pair in the text feature file;Extract the complete of the website to be identified
The non-textual statistical nature of source code;
The complete source code feature and its characteristic value of the website to be identified, non-textual statistical nature are inputted into illegal website identification
Model, judges whether the website to be identified is illegal website;
Wherein, the generation method of the illegal website identification model is:Obtain the complete of each website in sample site measure set
Source code, obtains complete source code set;The text feature of each complete source code in the complete source code set is extracted, it is complete to obtain this
The text feature set of source code;Extract the non-textual statistical nature of each complete source code;To the spy in each text feature set
Sign merges and calculates the mean eigenvalue of each feature, obtains text feature file;
Based on the text feature set of the corresponding complete source code in each website in the sample site measure set, non-textual statistical nature and
Machine learning algorithm, generates illegal website identification model.
5. the method as described in right wants 4, it is characterised in that the method for obtaining the text feature file is:To the complete source
Chinese information in code is segmented and calculates the TF-IDF values of each participle;The information gain value of participle is then based on, is chosen
Feature of multiple participles as the complete source code, then using the feature of selection and its corresponding TF-IDF values as the complete source code
Text feature set;Feature in each text feature set is merged and according to same feature in different text feature collection
TF-IDF values in conjunction calculate the average TF-IDF values of this feature, according to the feature after merging and its average TF-IDF values generation institute
State text feature file.
6. the method as described in right wants 4, it is characterised in that the statistical nature includes the complete source code<iframe>Label
Quantity,<title>The average length of label, the quantity of URL,<div>The quantity of label,<ul>Quantity, the quantity of & symbols.
7. method as claimed in claim 4, it is characterised in that the machine learning algorithm is random forests algorithm.
8. a kind of illegal website detecting system, it is characterised in that including complete source code extraction module, characteristic extracting module and illegal
Website identification model;Wherein,
The complete source code extraction module, for extracting the complete source code of website;The website includes website to be identified and sample
Each website in set of websites;The sample site measure set includes multiple illegal websites and multiple legitimate sites;
The characteristic extracting module, for extracting the text feature of complete source code, obtains the text feature set of the complete source code;
And extract the non-textual statistical nature of the complete source code;Feature in each text feature set is merged and is calculated every
The mean eigenvalue of one feature, obtains text feature file;
The illegal website identification model, for the complete source code feature and its characteristic value according to website to be identified, non-textual
Statistical nature, judges whether the website to be identified is illegal website;
Wherein, it is special based on the text feature set of the corresponding complete source code in each website, non-textual statistics in sample site measure set
Seek peace machine learning algorithm, generate the illegal website identification model.
9. system as claimed in claim 8, it is characterised in that the characteristic extracting module believes the Chinese in the complete source code
Breath is segmented and calculates the TF-IDF values of each participle;The information gain value of participle is then based on, chooses multiple participle conducts
The feature of the complete source code, then the text feature collection using the feature of selection and its corresponding TF-IDF values as the complete source code
Close;Feature in each text feature set is merged and the TF-IDF according to same feature in different text feature set
Value calculates the average TF-IDF values of this feature, and the text feature text is generated according to the feature after merging and its average TF-IDF values
Part.
10. system as claimed in claim 8, it is characterised in that the statistical nature includes the complete source code<iframe>
Number of labels,<title>The average length of label, the quantity of URL,<div>The quantity of label,<ul>Quantity, the number of & symbols
Amount.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710940131.1A CN107957872A (en) | 2017-10-11 | 2017-10-11 | A kind of full web site source code acquisition methods and illegal website detection method, system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710940131.1A CN107957872A (en) | 2017-10-11 | 2017-10-11 | A kind of full web site source code acquisition methods and illegal website detection method, system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107957872A true CN107957872A (en) | 2018-04-24 |
Family
ID=61953975
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710940131.1A Pending CN107957872A (en) | 2017-10-11 | 2017-10-11 | A kind of full web site source code acquisition methods and illegal website detection method, system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107957872A (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108875060A (en) * | 2018-06-29 | 2018-11-23 | 成都市映潮科技股份有限公司 | A kind of website identification method and identifying system |
CN109522454A (en) * | 2018-11-20 | 2019-03-26 | 四川长虹电器股份有限公司 | The method for automatically generating web sample data |
CN110138794A (en) * | 2019-05-22 | 2019-08-16 | 杭州安恒信息技术股份有限公司 | A kind of counterfeit website identification method, device, equipment and readable storage medium storing program for executing |
CN110275958A (en) * | 2019-06-26 | 2019-09-24 | 北京市博汇科技股份有限公司 | Site information recognition methods, device and electronic equipment |
US20190332968A1 (en) * | 2018-04-29 | 2019-10-31 | Microsoft Technology Licensing, Llc. | Code completion for languages with hierarchical structures |
CN111311411A (en) * | 2020-02-14 | 2020-06-19 | 北京三快在线科技有限公司 | Illegal behavior identification method and device |
CN111339453A (en) * | 2018-12-19 | 2020-06-26 | 顺丰科技有限公司 | Navigation page distinguishing method and device |
WO2020151173A1 (en) * | 2019-01-25 | 2020-07-30 | 深信服科技股份有限公司 | Webpage tampering detection method and related apparatus |
CN111506791A (en) * | 2020-04-10 | 2020-08-07 | 安徽博约信息科技股份有限公司 | Method for monitoring medical content of affiliated network station |
CN112347244A (en) * | 2019-08-08 | 2021-02-09 | 四川大学 | Method for detecting website involved in yellow and gambling based on mixed feature analysis |
CN112347402A (en) * | 2020-10-21 | 2021-02-09 | 上海淇玥信息技术有限公司 | Illegal website/APP automatic identification method, system and electronic device |
CN114553486A (en) * | 2022-01-20 | 2022-05-27 | 北京百度网讯科技有限公司 | Illegal data processing method and device, electronic equipment and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1372334A2 (en) * | 1995-05-08 | 2003-12-17 | Digimarc Corporation | Method of embedding a machine readable steganographic code |
CN101436210A (en) * | 2008-12-16 | 2009-05-20 | 北京百问百答网络技术有限公司 | Method and system for recognizing counterfeit web page |
CN101520796A (en) * | 2009-02-16 | 2009-09-02 | 深圳市腾讯计算机系统有限公司 | Method and system for extracting uniform resource locators from web page content |
CN102332028A (en) * | 2011-10-15 | 2012-01-25 | 西安交通大学 | Webpage-oriented unhealthy Web content identifying method |
CN103544210A (en) * | 2013-09-02 | 2014-01-29 | 烟台中科网络技术研究所 | System and method for identifying webpage types |
CN104714980A (en) * | 2013-12-17 | 2015-06-17 | 阿里巴巴集团控股有限公司 | Page nesting path determination method and device |
CN105704099A (en) * | 2014-11-26 | 2016-06-22 | 国家电网公司 | Method for detecting illegal links hidden in website scripts |
CN107181730A (en) * | 2017-03-13 | 2017-09-19 | 烟台中科网络技术研究所 | A kind of counterfeit website monitoring recognition methods and system |
-
2017
- 2017-10-11 CN CN201710940131.1A patent/CN107957872A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1372334A2 (en) * | 1995-05-08 | 2003-12-17 | Digimarc Corporation | Method of embedding a machine readable steganographic code |
CN101436210A (en) * | 2008-12-16 | 2009-05-20 | 北京百问百答网络技术有限公司 | Method and system for recognizing counterfeit web page |
CN101520796A (en) * | 2009-02-16 | 2009-09-02 | 深圳市腾讯计算机系统有限公司 | Method and system for extracting uniform resource locators from web page content |
CN102332028A (en) * | 2011-10-15 | 2012-01-25 | 西安交通大学 | Webpage-oriented unhealthy Web content identifying method |
CN103544210A (en) * | 2013-09-02 | 2014-01-29 | 烟台中科网络技术研究所 | System and method for identifying webpage types |
CN104714980A (en) * | 2013-12-17 | 2015-06-17 | 阿里巴巴集团控股有限公司 | Page nesting path determination method and device |
CN105704099A (en) * | 2014-11-26 | 2016-06-22 | 国家电网公司 | Method for detecting illegal links hidden in website scripts |
CN107181730A (en) * | 2017-03-13 | 2017-09-19 | 烟台中科网络技术研究所 | A kind of counterfeit website monitoring recognition methods and system |
Non-Patent Citations (1)
Title |
---|
小木人印象: "测试获取iframe加载后内容或调用其内部方法示例(兼容主流浏览器)", 《WWW.XWOOD.NET/_SITE_DOMAIN_/_ROOT/5870/5874/T_C262630.HTML》 * |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190332968A1 (en) * | 2018-04-29 | 2019-10-31 | Microsoft Technology Licensing, Llc. | Code completion for languages with hierarchical structures |
US11645576B2 (en) * | 2018-04-29 | 2023-05-09 | Microsoft Technology Licensing, Llc. | Code completion for languages with hierarchical structures |
CN108875060B (en) * | 2018-06-29 | 2021-02-26 | 成都市映潮科技股份有限公司 | Website identification method and identification system |
CN108875060A (en) * | 2018-06-29 | 2018-11-23 | 成都市映潮科技股份有限公司 | A kind of website identification method and identifying system |
CN109522454A (en) * | 2018-11-20 | 2019-03-26 | 四川长虹电器股份有限公司 | The method for automatically generating web sample data |
CN109522454B (en) * | 2018-11-20 | 2022-06-03 | 四川长虹电器股份有限公司 | Method for automatically generating web sample data |
CN111339453A (en) * | 2018-12-19 | 2020-06-26 | 顺丰科技有限公司 | Navigation page distinguishing method and device |
WO2020151173A1 (en) * | 2019-01-25 | 2020-07-30 | 深信服科技股份有限公司 | Webpage tampering detection method and related apparatus |
CN111488623A (en) * | 2019-01-25 | 2020-08-04 | 深信服科技股份有限公司 | Webpage tampering detection method and related device |
CN110138794A (en) * | 2019-05-22 | 2019-08-16 | 杭州安恒信息技术股份有限公司 | A kind of counterfeit website identification method, device, equipment and readable storage medium storing program for executing |
CN110275958A (en) * | 2019-06-26 | 2019-09-24 | 北京市博汇科技股份有限公司 | Site information recognition methods, device and electronic equipment |
CN110275958B (en) * | 2019-06-26 | 2021-07-27 | 北京市博汇科技股份有限公司 | Website information identification method and device and electronic equipment |
CN112347244A (en) * | 2019-08-08 | 2021-02-09 | 四川大学 | Method for detecting website involved in yellow and gambling based on mixed feature analysis |
CN111311411B (en) * | 2020-02-14 | 2022-03-08 | 北京三快在线科技有限公司 | Illegal behavior identification method and device |
CN111311411A (en) * | 2020-02-14 | 2020-06-19 | 北京三快在线科技有限公司 | Illegal behavior identification method and device |
CN111506791A (en) * | 2020-04-10 | 2020-08-07 | 安徽博约信息科技股份有限公司 | Method for monitoring medical content of affiliated network station |
CN112347402A (en) * | 2020-10-21 | 2021-02-09 | 上海淇玥信息技术有限公司 | Illegal website/APP automatic identification method, system and electronic device |
CN114553486A (en) * | 2022-01-20 | 2022-05-27 | 北京百度网讯科技有限公司 | Illegal data processing method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107957872A (en) | A kind of full web site source code acquisition methods and illegal website detection method, system | |
US8838992B1 (en) | Identification of normal scripts in computer systems | |
Blum et al. | Lexical feature based phishing URL detection using online learning | |
US8943588B1 (en) | Detecting unauthorized websites | |
Dunlop et al. | Goldphish: Using images for content-based phishing analysis | |
US9521161B2 (en) | Method and apparatus for detecting computer fraud | |
US20150244728A1 (en) | Method and device for detecting malicious url | |
US10311120B2 (en) | Method and apparatus for identifying webpage type | |
CN106230831B (en) | A kind of method and system identifying browser uniqueness and feature of risk | |
CN107862050A (en) | A kind of web site contents safety detecting system and method | |
CN108038173B (en) | Webpage classification method and system and webpage classification equipment | |
CN108881138B (en) | Webpage request identification method and device | |
US20220030029A1 (en) | Phishing Protection Methods and Systems | |
CN110572359A (en) | Phishing webpage detection method based on machine learning | |
CN106682489A (en) | Password security detection method, password security reminding method and corresponding devices | |
CN109858248A (en) | Malice Word document detection method and device | |
CN107786537A (en) | A kind of lonely page implantation attack detection method based on internet intersection search | |
CN109922065A (en) | Malicious websites method for quickly identifying | |
CN106060038B (en) | Detection method for phishing site based on client-side program behavioural analysis | |
Kumar et al. | edarkfind: Unsupervised multi-view learning for sybil account detection | |
JP7182764B2 (en) | Fraudulent web page detection device, control method and control program for fraudulent web page detection device | |
CN106357682A (en) | Phishing website detecting method | |
Sonowal et al. | Masphid: a model to assist screen reader users for detecting phishing sites using aural and visual similarity measures | |
CN114048480A (en) | Vulnerability detection method, device, equipment and storage medium | |
Thao et al. | Hunting brand domain forgery: a scalable classification for homograph attack |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180424 |
|
RJ01 | Rejection of invention patent application after publication |