CN106411868A - Method for automatically identifying web crawler - Google Patents
Method for automatically identifying web crawler Download PDFInfo
- Publication number
- CN106411868A CN106411868A CN201610831757.4A CN201610831757A CN106411868A CN 106411868 A CN106411868 A CN 106411868A CN 201610831757 A CN201610831757 A CN 201610831757A CN 106411868 A CN106411868 A CN 106411868A
- Authority
- CN
- China
- Prior art keywords
- code
- cookie
- page
- web crawler
- home page
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
- H04L63/1466—Active attacks involving interception, injection, modification, spoofing of data unit addresses, e.g. hijacking, packet injection or TCP sequence number attacks
Abstract
The invention discloses a method for automatically identifying a web crawler. The method comprises the following steps of step 1 of returning a server home page to a page containing only an JS code, wherein the code is located in an onload function and is executed after the page is loaded completely; step 2 of adopting a certain algorithm to set a cookie field for the JS code in step 1, then using window.location to skip to the home page; and detecting the cookie is legal by a sever and returning to another JS code, and adopting another algorithm to set a cookie filed for the another JS code; step 3 of returning to a normal home page URL when all cookie fields are legal; and step 4 of setting a badcookie, and marking the badcookie as the crawler when a client does not have a redirection operation, or the cookie value is incorrect. The method for automatically identifying the web crawler provided by the invention can block the access of most static crawlers, and if the crawlers cannot execute the JS code of the home page, then the crawlers can only crawl to the home page returned by the server only containing the JS code, thus the real home page cannot be acquired.
Description
Technical field
The present invention relates to web crawler field is and in particular to a kind of method of automatic identification web crawler.
Background technology
Current site is varied to web crawlers knowledge method for distinguishing, and most effective and widely used method is to provide to hand over
Mutually the assembly of formula, to differentiate that client is real user or web crawlers, such as identifying code etc..But this mode can be at certain
The online experience of user is affected on degree.
Reptile, during crawl Website page, can crawl to homepage.Simultaneously because reptile generally will not be repeated
Crawl the page with identical URL, can identify with this whether request is derived from crawlers.In prior art, in the page
Place dark chain to do honey jar to identify reptile, or according to the characteristic information (HTTP header etc.) of reptile as basis of characterization.But it is dark
Chain can be identified, and calculating header information needs extra resource consumption.
Relational language:
onload:Browser can execute the function in onload after the page has loaded;Reptile:It is used for capturing info web
Application program;Redirect:Exactly by various methods, network request is repositioned onto other positions, (such as:Webpage resets
To, domain name redirects etc.);Removing duplicate webpages:During crawler capturing info web, judge two by the similarity calculating two pages
Whether the individual page is similar or identical, thus avoiding repeating to crawl;URL:URL, is commonly called as network address;Cookie:Net
Stand to distinguish that user identity is stored in the data at user.
Content of the invention
The technical problem to be solved is to provide a kind of method of automatic identification web crawler, by repeatedly resetting
To the request to intercept with setting cookie from web crawlers, do not affect user's online experience.
For solving above-mentioned technical problem, the technical solution used in the present invention is:
A kind of method of automatic identification web crawler, comprises the following steps:
Step 1:Server homepage returns the page only comprising JS code, and this section of code is located in onload function, in page
Face is performed after loading completely;
Step 2:JS code described in step 1 adopts the first symmetric encipherment algorithm to set one by Set-Cookie head
Individual cookie field, then jumps to homepage using window.location;Detection cookie is legal for server, returns another
Section JS code, another section of JS code adopts second symmetric encipherment algorithm to set cookie field;
Step 3:When all of cookie field is all legal, then return normal homepage URL;
Step 4:If client does not redirect operation, or cookie value is incorrect, then arrange badcookie,
It is labeled as reptile.
According to such scheme, described step 1,2,3 repeated several times, but the redirection upper limit less than browser setting.
According to such scheme, the first symmetric encipherment algorithm described be DES, TripleDES, RC2, RC4, RC5 and
One of Blowfish, described second symmetric encipherment algorithm is DES, TripleDES, RC2, RC4, RC5 and Blowfish
One of, and differ with the first symmetric encipherment algorithm.
Compared with prior art, the invention has the beneficial effects as follows:1) access of most of static state reptile can be stopped, if climbed
Worm cannot execute the JS code of homepage, then can only climb to the homepage of the only JS code of server return it is impossible to obtain truly first
Page.2) as long as reptile has duplicate removal function, as jump to same page to lead to not continue to crawl.3) this method is suitable for
The page include but is not limited to homepage, can adopt in any page of website, effectively prevent reptile collection information.
Brief description
Fig. 1 is a kind of method flow schematic diagram of present invention automatic identification web crawler.
Specific embodiment
The present invention is further detailed explanation with reference to the accompanying drawings and detailed description.By embedded in webpage
Javascript once or is repeatedly redirected to the same page, returns conditional code so that reptile is because duplicate removal cannot be just simultaneously
Often crawl the page.Cookie or badcookie that javascript code in execution onload is specified, whether identification request
From reptile.
Server homepage returns the page only comprising JS code (the script literary composition of written in JavaScript expands the code of part), this
Section code is located in onload function, is performed after the page loads completely.This section of JS code can adopt certain algorithm (IP, head
The information such as portion are as algorithm parameter) set a cookie field, then jump to homepage (this using window.location
The page).Detection cookie is legal for server, returns another section of JS, and this JS code sets cookie word using another kind of algorithm
Section.According to website needs, abovementioned steps with repeated several times, but can not can exceed that the redirection upper limit of browser setting.Only
In the case that all of cookie field is all legal, just can return normal homepage URL.If client does not redirect behaviour
Make, or cookie value is incorrect, then can arrange badcookie, be labeled as reptile.Can be remembered according to server request simultaneously
It is reptile that request number of times in record judges, what such as first time get request just comprised all correct cookie must be reptile.
Algorithm involved in the present invention is symmetrical AES, mainly have DES, TripleDES, RC2, RC4, RC5 and
Blowfish.In order to prevent user in advance in a browser accession page obtain correct cookie, can be in every one-level page of website
All add a page with said function in appearance record, to strengthen the effect of anti-reptile.
Claims (3)
1. a kind of method of automatic identification web crawler is it is characterised in that comprise the following steps:
Step 1:Server homepage returns the page only comprising JS code, and this section of code is located in onload function, complete in the page
It is performed after full loading;
Step 2:JS code described in step 1 adopts the first symmetric encipherment algorithm to set one by Set-Cookie head
Cookie field, then jumps to homepage using window.location;Detection cookie is legal for server, returns another section
JS code, another section of JS code adopts second symmetric encipherment algorithm to set cookie field;
Step 3:When all of cookie field is all legal, then return normal homepage URL;
Step 4:If client does not redirect operation, or cookie value is incorrect, then arrange badcookie, mark
For reptile.
2. as claimed in claim 1 a kind of method of automatic identification web crawler it is characterised in that described step 1,2,3 repetition
Several times, but less than browser the redirection upper limit arranging.
3. as claimed in claim 1 or 2 a kind of method of automatic identification web crawler it is characterised in that described the first is symmetrical
AES is one of DES, TripleDES, RC2, RC4, RC5 and Blowfish, described second symmetric encipherment algorithm
For one of DES, TripleDES, RC2, RC4, RC5 and Blowfish, and differ with the first symmetric encipherment algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610831757.4A CN106411868A (en) | 2016-09-19 | 2016-09-19 | Method for automatically identifying web crawler |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610831757.4A CN106411868A (en) | 2016-09-19 | 2016-09-19 | Method for automatically identifying web crawler |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106411868A true CN106411868A (en) | 2017-02-15 |
Family
ID=57996638
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610831757.4A Pending CN106411868A (en) | 2016-09-19 | 2016-09-19 | Method for automatically identifying web crawler |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106411868A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107147640A (en) * | 2017-05-09 | 2017-09-08 | 网宿科技股份有限公司 | Recognize the method and system of web crawlers |
CN111181933A (en) * | 2019-12-19 | 2020-05-19 | 贝壳技术有限公司 | Web crawler detection method and device, storage medium and electronic equipment |
CN111355728A (en) * | 2020-02-27 | 2020-06-30 | 紫光云技术有限公司 | Malicious crawler protection method |
CN112398963A (en) * | 2020-10-13 | 2021-02-23 | 易讯科技股份有限公司 | Method for realizing intelligent recognition and flexible translation of IPv4 external link |
CN112437036A (en) * | 2020-01-21 | 2021-03-02 | 上海哔哩哔哩科技有限公司 | Data analysis method and equipment |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070005606A1 (en) * | 2005-06-29 | 2007-01-04 | Shivakumar Ganesan | Approach for requesting web pages from a web server using web-page specific cookie data |
CN101089856A (en) * | 2007-07-20 | 2007-12-19 | 李沫南 | Method for abstracting network data and web reptile system |
US20080178162A1 (en) * | 2007-01-18 | 2008-07-24 | Aol Llc | Server evaluation of client-side script |
US7546370B1 (en) * | 2004-08-18 | 2009-06-09 | Google Inc. | Search engine with multiple crawlers sharing cookies |
CN102833212A (en) * | 2011-06-14 | 2012-12-19 | 阿里巴巴集团控股有限公司 | Webpage visitor identity identification method and system |
CN103888490A (en) * | 2012-12-20 | 2014-06-25 | 上海天泰网络技术有限公司 | Automatic WEB client man-machine identification method |
CN105743901A (en) * | 2016-03-07 | 2016-07-06 | 携程计算机技术(上海)有限公司 | Server, anti-crawler system and anti-crawler verification method |
CN105897694A (en) * | 2016-03-25 | 2016-08-24 | 网宿科技股份有限公司 | Session identification method and system of client |
-
2016
- 2016-09-19 CN CN201610831757.4A patent/CN106411868A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7546370B1 (en) * | 2004-08-18 | 2009-06-09 | Google Inc. | Search engine with multiple crawlers sharing cookies |
US20070005606A1 (en) * | 2005-06-29 | 2007-01-04 | Shivakumar Ganesan | Approach for requesting web pages from a web server using web-page specific cookie data |
US20080178162A1 (en) * | 2007-01-18 | 2008-07-24 | Aol Llc | Server evaluation of client-side script |
CN101089856A (en) * | 2007-07-20 | 2007-12-19 | 李沫南 | Method for abstracting network data and web reptile system |
CN102833212A (en) * | 2011-06-14 | 2012-12-19 | 阿里巴巴集团控股有限公司 | Webpage visitor identity identification method and system |
CN103888490A (en) * | 2012-12-20 | 2014-06-25 | 上海天泰网络技术有限公司 | Automatic WEB client man-machine identification method |
CN105743901A (en) * | 2016-03-07 | 2016-07-06 | 携程计算机技术(上海)有限公司 | Server, anti-crawler system and anti-crawler verification method |
CN105897694A (en) * | 2016-03-25 | 2016-08-24 | 网宿科技股份有限公司 | Session identification method and system of client |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107147640A (en) * | 2017-05-09 | 2017-09-08 | 网宿科技股份有限公司 | Recognize the method and system of web crawlers |
CN107147640B (en) * | 2017-05-09 | 2019-12-31 | 网宿科技股份有限公司 | Method and system for identifying web crawler |
CN111181933A (en) * | 2019-12-19 | 2020-05-19 | 贝壳技术有限公司 | Web crawler detection method and device, storage medium and electronic equipment |
CN112437036A (en) * | 2020-01-21 | 2021-03-02 | 上海哔哩哔哩科技有限公司 | Data analysis method and equipment |
CN112437036B (en) * | 2020-01-21 | 2023-01-24 | 上海哔哩哔哩科技有限公司 | Data analysis method and equipment |
CN111355728A (en) * | 2020-02-27 | 2020-06-30 | 紫光云技术有限公司 | Malicious crawler protection method |
CN111355728B (en) * | 2020-02-27 | 2023-01-03 | 紫光云技术有限公司 | Malicious crawler protection method |
CN112398963A (en) * | 2020-10-13 | 2021-02-23 | 易讯科技股份有限公司 | Method for realizing intelligent recognition and flexible translation of IPv4 external link |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106411868A (en) | Method for automatically identifying web crawler | |
US10567407B2 (en) | Method and system for detecting malicious web addresses | |
US11727114B2 (en) | Systems and methods for remote detection of software through browser webinjects | |
US9509714B2 (en) | Web page and web browser protection against malicious injections | |
CN104881603B (en) | Webpage redirects leak detection method and device | |
US20200036799A1 (en) | System and method for main page identification in web decoding | |
US9544316B2 (en) | Method, device and system for detecting security of download link | |
CN102833212B (en) | Webpage visitor identity identification method and system | |
CN103279710B (en) | Method and system for detecting malicious codes of Internet information system | |
US20090024748A1 (en) | Website monitoring and cookie setting | |
CN102436564A (en) | Method and device for identifying falsified webpage | |
CN102739653B (en) | Detection method and device aiming at webpage address | |
CN105760379B (en) | Method and device for detecting webshell page based on intra-domain page association relation | |
Yusof et al. | Preventing persistent Cross-Site Scripting (XSS) attack by applying pattern filtering approach | |
US20190222607A1 (en) | System and method to detect and block bot traffic | |
CN102968584B (en) | A kind of method and apparatus of log-on webpage | |
CN110442286B (en) | Page display method and device and electronic equipment | |
Kaur et al. | Browser fingerprinting as user tracking technology | |
CN104679747A (en) | Detection device and method for website redirection | |
CN112637361A (en) | Page proxy method, device, electronic equipment and storage medium | |
CN111143722A (en) | Method, device, equipment and medium for detecting webpage hidden link | |
US20150046787A1 (en) | Url tagging based on user behavior | |
US9396170B2 (en) | Hyperlink data presentation | |
CN103929498A (en) | Method and device for processing client requests | |
CN112287349A (en) | Security vulnerability detection method and server |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170215 |
|
RJ01 | Rejection of invention patent application after publication |