CN106411868A

CN106411868A - Method for automatically identifying web crawler

Info

Publication number: CN106411868A
Application number: CN201610831757.4A
Authority: CN
Inventors: 周雨晨
Original assignee: Chengdu Zhidaochuangyu Information Technology Co Ltd
Current assignee: Chengdu Zhidaochuangyu Information Technology Co Ltd
Priority date: 2016-09-19
Filing date: 2016-09-19
Publication date: 2017-02-15

Abstract

The invention discloses a method for automatically identifying a web crawler. The method comprises the following steps of step 1 of returning a server home page to a page containing only an JS code, wherein the code is located in an onload function and is executed after the page is loaded completely; step 2 of adopting a certain algorithm to set a cookie field for the JS code in step 1, then using window.location to skip to the home page; and detecting the cookie is legal by a sever and returning to another JS code, and adopting another algorithm to set a cookie filed for the another JS code; step 3 of returning to a normal home page URL when all cookie fields are legal; and step 4 of setting a badcookie, and marking the badcookie as the crawler when a client does not have a redirection operation, or the cookie value is incorrect. The method for automatically identifying the web crawler provided by the invention can block the access of most static crawlers, and if the crawlers cannot execute the JS code of the home page, then the crawlers can only crawl to the home page returned by the server only containing the JS code, thus the real home page cannot be acquired.

Description

A kind of method of automatic identification web crawler

Technical field

The present invention relates to web crawler field is and in particular to a kind of method of automatic identification web crawler.

Background technology

Current site is varied to web crawlers knowledge method for distinguishing, and most effective and widely used method is to provide to hand over Mutually the assembly of formula, to differentiate that client is real user or web crawlers, such as identifying code etc..But this mode can be at certain The online experience of user is affected on degree.

Reptile, during crawl Website page, can crawl to homepage.Simultaneously because reptile generally will not be repeated Crawl the page with identical URL, can identify with this whether request is derived from crawlers.In prior art, in the page Place dark chain to do honey jar to identify reptile, or according to the characteristic information (HTTP header etc.) of reptile as basis of characterization.But it is dark Chain can be identified, and calculating header information needs extra resource consumption.

Relational language：

onload：Browser can execute the function in onload after the page has loaded；Reptile：It is used for capturing info web Application program；Redirect：Exactly by various methods, network request is repositioned onto other positions, (such as：Webpage resets To, domain name redirects etc.)；Removing duplicate webpages：During crawler capturing info web, judge two by the similarity calculating two pages Whether the individual page is similar or identical, thus avoiding repeating to crawl；URL：URL, is commonly called as network address；Cookie：Net Stand to distinguish that user identity is stored in the data at user.

Content of the invention

The technical problem to be solved is to provide a kind of method of automatic identification web crawler, by repeatedly resetting To the request to intercept with setting cookie from web crawlers, do not affect user's online experience.

For solving above-mentioned technical problem, the technical solution used in the present invention is：

A kind of method of automatic identification web crawler, comprises the following steps：

Step 1：Server homepage returns the page only comprising JS code, and this section of code is located in onload function, in page Face is performed after loading completely；

Step 2：JS code described in step 1 adopts the first symmetric encipherment algorithm to set one by Set-Cookie head Individual cookie field, then jumps to homepage using window.location；Detection cookie is legal for server, returns another Section JS code, another section of JS code adopts second symmetric encipherment algorithm to set cookie field；

Step 3：When all of cookie field is all legal, then return normal homepage URL；

Step 4：If client does not redirect operation, or cookie value is incorrect, then arrange badcookie, It is labeled as reptile.

According to such scheme, described step 1,2,3 repeated several times, but the redirection upper limit less than browser setting.

According to such scheme, the first symmetric encipherment algorithm described be DES, TripleDES, RC2, RC4, RC5 and One of Blowfish, described second symmetric encipherment algorithm is DES, TripleDES, RC2, RC4, RC5 and Blowfish One of, and differ with the first symmetric encipherment algorithm.

Compared with prior art, the invention has the beneficial effects as follows：1) access of most of static state reptile can be stopped, if climbed Worm cannot execute the JS code of homepage, then can only climb to the homepage of the only JS code of server return it is impossible to obtain truly first Page.2) as long as reptile has duplicate removal function, as jump to same page to lead to not continue to crawl.3) this method is suitable for The page include but is not limited to homepage, can adopt in any page of website, effectively prevent reptile collection information.

Brief description

Fig. 1 is a kind of method flow schematic diagram of present invention automatic identification web crawler.

Specific embodiment

The present invention is further detailed explanation with reference to the accompanying drawings and detailed description.By embedded in webpage Javascript once or is repeatedly redirected to the same page, returns conditional code so that reptile is because duplicate removal cannot be just simultaneously Often crawl the page.Cookie or badcookie that javascript code in execution onload is specified, whether identification request From reptile.

Server homepage returns the page only comprising JS code (the script literary composition of written in JavaScript expands the code of part), this Section code is located in onload function, is performed after the page loads completely.This section of JS code can adopt certain algorithm (IP, head The information such as portion are as algorithm parameter) set a cookie field, then jump to homepage (this using window.location The page).Detection cookie is legal for server, returns another section of JS, and this JS code sets cookie word using another kind of algorithm Section.According to website needs, abovementioned steps with repeated several times, but can not can exceed that the redirection upper limit of browser setting.Only In the case that all of cookie field is all legal, just can return normal homepage URL.If client does not redirect behaviour Make, or cookie value is incorrect, then can arrange badcookie, be labeled as reptile.Can be remembered according to server request simultaneously It is reptile that request number of times in record judges, what such as first time get request just comprised all correct cookie must be reptile.

Algorithm involved in the present invention is symmetrical AES, mainly have DES, TripleDES, RC2, RC4, RC5 and Blowfish.In order to prevent user in advance in a browser accession page obtain correct cookie, can be in every one-level page of website All add a page with said function in appearance record, to strengthen the effect of anti-reptile.

Claims

1. a kind of method of automatic identification web crawler is it is characterised in that comprise the following steps：

Step 1：Server homepage returns the page only comprising JS code, and this section of code is located in onload function, complete in the page It is performed after full loading；

Step 2：JS code described in step 1 adopts the first symmetric encipherment algorithm to set one by Set-Cookie head Cookie field, then jumps to homepage using window.location；Detection cookie is legal for server, returns another section JS code, another section of JS code adopts second symmetric encipherment algorithm to set cookie field；

Step 4：If client does not redirect operation, or cookie value is incorrect, then arrange badcookie, mark For reptile.

2. as claimed in claim 1 a kind of method of automatic identification web crawler it is characterised in that described step 1,2,3 repetition Several times, but less than browser the redirection upper limit arranging.

3. as claimed in claim 1 or 2 a kind of method of automatic identification web crawler it is characterised in that described the first is symmetrical AES is one of DES, TripleDES, RC2, RC4, RC5 and Blowfish, described second symmetric encipherment algorithm For one of DES, TripleDES, RC2, RC4, RC5 and Blowfish, and differ with the first symmetric encipherment algorithm.