CN106411868A - Method for automatically identifying web crawler - Google Patents

Method for automatically identifying web crawler Download PDF

Info

Publication number
CN106411868A
CN106411868A CN201610831757.4A CN201610831757A CN106411868A CN 106411868 A CN106411868 A CN 106411868A CN 201610831757 A CN201610831757 A CN 201610831757A CN 106411868 A CN106411868 A CN 106411868A
Authority
CN
China
Prior art keywords
code
cookie
page
web crawler
home page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610831757.4A
Other languages
Chinese (zh)
Inventor
周雨晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Zhidaochuangyu Information Technology Co Ltd
Original Assignee
Chengdu Zhidaochuangyu Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Zhidaochuangyu Information Technology Co Ltd filed Critical Chengdu Zhidaochuangyu Information Technology Co Ltd
Priority to CN201610831757.4A priority Critical patent/CN106411868A/en
Publication of CN106411868A publication Critical patent/CN106411868A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1466Active attacks involving interception, injection, modification, spoofing of data unit addresses, e.g. hijacking, packet injection or TCP sequence number attacks

Abstract

The invention discloses a method for automatically identifying a web crawler. The method comprises the following steps of step 1 of returning a server home page to a page containing only an JS code, wherein the code is located in an onload function and is executed after the page is loaded completely; step 2 of adopting a certain algorithm to set a cookie field for the JS code in step 1, then using window.location to skip to the home page; and detecting the cookie is legal by a sever and returning to another JS code, and adopting another algorithm to set a cookie filed for the another JS code; step 3 of returning to a normal home page URL when all cookie fields are legal; and step 4 of setting a badcookie, and marking the badcookie as the crawler when a client does not have a redirection operation, or the cookie value is incorrect. The method for automatically identifying the web crawler provided by the invention can block the access of most static crawlers, and if the crawlers cannot execute the JS code of the home page, then the crawlers can only crawl to the home page returned by the server only containing the JS code, thus the real home page cannot be acquired.

Description

A kind of method of automatic identification web crawler
Technical field
The present invention relates to web crawler field is and in particular to a kind of method of automatic identification web crawler.
Background technology
Current site is varied to web crawlers knowledge method for distinguishing, and most effective and widely used method is to provide to hand over Mutually the assembly of formula, to differentiate that client is real user or web crawlers, such as identifying code etc..But this mode can be at certain The online experience of user is affected on degree.
Reptile, during crawl Website page, can crawl to homepage.Simultaneously because reptile generally will not be repeated Crawl the page with identical URL, can identify with this whether request is derived from crawlers.In prior art, in the page Place dark chain to do honey jar to identify reptile, or according to the characteristic information (HTTP header etc.) of reptile as basis of characterization.But it is dark Chain can be identified, and calculating header information needs extra resource consumption.
Relational language:
onload:Browser can execute the function in onload after the page has loaded;Reptile:It is used for capturing info web Application program;Redirect:Exactly by various methods, network request is repositioned onto other positions, (such as:Webpage resets To, domain name redirects etc.);Removing duplicate webpages:During crawler capturing info web, judge two by the similarity calculating two pages Whether the individual page is similar or identical, thus avoiding repeating to crawl;URL:URL, is commonly called as network address;Cookie:Net Stand to distinguish that user identity is stored in the data at user.
Content of the invention
The technical problem to be solved is to provide a kind of method of automatic identification web crawler, by repeatedly resetting To the request to intercept with setting cookie from web crawlers, do not affect user's online experience.
For solving above-mentioned technical problem, the technical solution used in the present invention is:
A kind of method of automatic identification web crawler, comprises the following steps:
Step 1:Server homepage returns the page only comprising JS code, and this section of code is located in onload function, in page Face is performed after loading completely;
Step 2:JS code described in step 1 adopts the first symmetric encipherment algorithm to set one by Set-Cookie head Individual cookie field, then jumps to homepage using window.location;Detection cookie is legal for server, returns another Section JS code, another section of JS code adopts second symmetric encipherment algorithm to set cookie field;
Step 3:When all of cookie field is all legal, then return normal homepage URL;
Step 4:If client does not redirect operation, or cookie value is incorrect, then arrange badcookie, It is labeled as reptile.
According to such scheme, described step 1,2,3 repeated several times, but the redirection upper limit less than browser setting.
According to such scheme, the first symmetric encipherment algorithm described be DES, TripleDES, RC2, RC4, RC5 and One of Blowfish, described second symmetric encipherment algorithm is DES, TripleDES, RC2, RC4, RC5 and Blowfish One of, and differ with the first symmetric encipherment algorithm.
Compared with prior art, the invention has the beneficial effects as follows:1) access of most of static state reptile can be stopped, if climbed Worm cannot execute the JS code of homepage, then can only climb to the homepage of the only JS code of server return it is impossible to obtain truly first Page.2) as long as reptile has duplicate removal function, as jump to same page to lead to not continue to crawl.3) this method is suitable for The page include but is not limited to homepage, can adopt in any page of website, effectively prevent reptile collection information.
Brief description
Fig. 1 is a kind of method flow schematic diagram of present invention automatic identification web crawler.
Specific embodiment
The present invention is further detailed explanation with reference to the accompanying drawings and detailed description.By embedded in webpage Javascript once or is repeatedly redirected to the same page, returns conditional code so that reptile is because duplicate removal cannot be just simultaneously Often crawl the page.Cookie or badcookie that javascript code in execution onload is specified, whether identification request From reptile.
Server homepage returns the page only comprising JS code (the script literary composition of written in JavaScript expands the code of part), this Section code is located in onload function, is performed after the page loads completely.This section of JS code can adopt certain algorithm (IP, head The information such as portion are as algorithm parameter) set a cookie field, then jump to homepage (this using window.location The page).Detection cookie is legal for server, returns another section of JS, and this JS code sets cookie word using another kind of algorithm Section.According to website needs, abovementioned steps with repeated several times, but can not can exceed that the redirection upper limit of browser setting.Only In the case that all of cookie field is all legal, just can return normal homepage URL.If client does not redirect behaviour Make, or cookie value is incorrect, then can arrange badcookie, be labeled as reptile.Can be remembered according to server request simultaneously It is reptile that request number of times in record judges, what such as first time get request just comprised all correct cookie must be reptile.
Algorithm involved in the present invention is symmetrical AES, mainly have DES, TripleDES, RC2, RC4, RC5 and Blowfish.In order to prevent user in advance in a browser accession page obtain correct cookie, can be in every one-level page of website All add a page with said function in appearance record, to strengthen the effect of anti-reptile.

Claims (3)

1. a kind of method of automatic identification web crawler is it is characterised in that comprise the following steps:
Step 1:Server homepage returns the page only comprising JS code, and this section of code is located in onload function, complete in the page It is performed after full loading;
Step 2:JS code described in step 1 adopts the first symmetric encipherment algorithm to set one by Set-Cookie head Cookie field, then jumps to homepage using window.location;Detection cookie is legal for server, returns another section JS code, another section of JS code adopts second symmetric encipherment algorithm to set cookie field;
Step 3:When all of cookie field is all legal, then return normal homepage URL;
Step 4:If client does not redirect operation, or cookie value is incorrect, then arrange badcookie, mark For reptile.
2. as claimed in claim 1 a kind of method of automatic identification web crawler it is characterised in that described step 1,2,3 repetition Several times, but less than browser the redirection upper limit arranging.
3. as claimed in claim 1 or 2 a kind of method of automatic identification web crawler it is characterised in that described the first is symmetrical AES is one of DES, TripleDES, RC2, RC4, RC5 and Blowfish, described second symmetric encipherment algorithm For one of DES, TripleDES, RC2, RC4, RC5 and Blowfish, and differ with the first symmetric encipherment algorithm.
CN201610831757.4A 2016-09-19 2016-09-19 Method for automatically identifying web crawler Pending CN106411868A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610831757.4A CN106411868A (en) 2016-09-19 2016-09-19 Method for automatically identifying web crawler

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610831757.4A CN106411868A (en) 2016-09-19 2016-09-19 Method for automatically identifying web crawler

Publications (1)

Publication Number Publication Date
CN106411868A true CN106411868A (en) 2017-02-15

Family

ID=57996638

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610831757.4A Pending CN106411868A (en) 2016-09-19 2016-09-19 Method for automatically identifying web crawler

Country Status (1)

Country Link
CN (1) CN106411868A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107147640A (en) * 2017-05-09 2017-09-08 网宿科技股份有限公司 Recognize the method and system of web crawlers
CN111181933A (en) * 2019-12-19 2020-05-19 贝壳技术有限公司 Web crawler detection method and device, storage medium and electronic equipment
CN111355728A (en) * 2020-02-27 2020-06-30 紫光云技术有限公司 Malicious crawler protection method
CN112398963A (en) * 2020-10-13 2021-02-23 易讯科技股份有限公司 Method for realizing intelligent recognition and flexible translation of IPv4 external link
CN112437036A (en) * 2020-01-21 2021-03-02 上海哔哩哔哩科技有限公司 Data analysis method and equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070005606A1 (en) * 2005-06-29 2007-01-04 Shivakumar Ganesan Approach for requesting web pages from a web server using web-page specific cookie data
CN101089856A (en) * 2007-07-20 2007-12-19 李沫南 Method for abstracting network data and web reptile system
US20080178162A1 (en) * 2007-01-18 2008-07-24 Aol Llc Server evaluation of client-side script
US7546370B1 (en) * 2004-08-18 2009-06-09 Google Inc. Search engine with multiple crawlers sharing cookies
CN102833212A (en) * 2011-06-14 2012-12-19 阿里巴巴集团控股有限公司 Webpage visitor identity identification method and system
CN103888490A (en) * 2012-12-20 2014-06-25 上海天泰网络技术有限公司 Automatic WEB client man-machine identification method
CN105743901A (en) * 2016-03-07 2016-07-06 携程计算机技术(上海)有限公司 Server, anti-crawler system and anti-crawler verification method
CN105897694A (en) * 2016-03-25 2016-08-24 网宿科技股份有限公司 Session identification method and system of client

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7546370B1 (en) * 2004-08-18 2009-06-09 Google Inc. Search engine with multiple crawlers sharing cookies
US20070005606A1 (en) * 2005-06-29 2007-01-04 Shivakumar Ganesan Approach for requesting web pages from a web server using web-page specific cookie data
US20080178162A1 (en) * 2007-01-18 2008-07-24 Aol Llc Server evaluation of client-side script
CN101089856A (en) * 2007-07-20 2007-12-19 李沫南 Method for abstracting network data and web reptile system
CN102833212A (en) * 2011-06-14 2012-12-19 阿里巴巴集团控股有限公司 Webpage visitor identity identification method and system
CN103888490A (en) * 2012-12-20 2014-06-25 上海天泰网络技术有限公司 Automatic WEB client man-machine identification method
CN105743901A (en) * 2016-03-07 2016-07-06 携程计算机技术(上海)有限公司 Server, anti-crawler system and anti-crawler verification method
CN105897694A (en) * 2016-03-25 2016-08-24 网宿科技股份有限公司 Session identification method and system of client

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107147640A (en) * 2017-05-09 2017-09-08 网宿科技股份有限公司 Recognize the method and system of web crawlers
CN107147640B (en) * 2017-05-09 2019-12-31 网宿科技股份有限公司 Method and system for identifying web crawler
CN111181933A (en) * 2019-12-19 2020-05-19 贝壳技术有限公司 Web crawler detection method and device, storage medium and electronic equipment
CN112437036A (en) * 2020-01-21 2021-03-02 上海哔哩哔哩科技有限公司 Data analysis method and equipment
CN112437036B (en) * 2020-01-21 2023-01-24 上海哔哩哔哩科技有限公司 Data analysis method and equipment
CN111355728A (en) * 2020-02-27 2020-06-30 紫光云技术有限公司 Malicious crawler protection method
CN111355728B (en) * 2020-02-27 2023-01-03 紫光云技术有限公司 Malicious crawler protection method
CN112398963A (en) * 2020-10-13 2021-02-23 易讯科技股份有限公司 Method for realizing intelligent recognition and flexible translation of IPv4 external link

Similar Documents

Publication Publication Date Title
CN106411868A (en) Method for automatically identifying web crawler
US10567407B2 (en) Method and system for detecting malicious web addresses
US11727114B2 (en) Systems and methods for remote detection of software through browser webinjects
US9509714B2 (en) Web page and web browser protection against malicious injections
CN104881603B (en) Webpage redirects leak detection method and device
US20200036799A1 (en) System and method for main page identification in web decoding
US9544316B2 (en) Method, device and system for detecting security of download link
CN102833212B (en) Webpage visitor identity identification method and system
CN103279710B (en) Method and system for detecting malicious codes of Internet information system
US20090024748A1 (en) Website monitoring and cookie setting
CN102436564A (en) Method and device for identifying falsified webpage
CN102739653B (en) Detection method and device aiming at webpage address
CN105760379B (en) Method and device for detecting webshell page based on intra-domain page association relation
Yusof et al. Preventing persistent Cross-Site Scripting (XSS) attack by applying pattern filtering approach
US20190222607A1 (en) System and method to detect and block bot traffic
CN102968584B (en) A kind of method and apparatus of log-on webpage
CN110442286B (en) Page display method and device and electronic equipment
Kaur et al. Browser fingerprinting as user tracking technology
CN104679747A (en) Detection device and method for website redirection
CN112637361A (en) Page proxy method, device, electronic equipment and storage medium
CN111143722A (en) Method, device, equipment and medium for detecting webpage hidden link
US20150046787A1 (en) Url tagging based on user behavior
US9396170B2 (en) Hyperlink data presentation
CN103929498A (en) Method and device for processing client requests
CN112287349A (en) Security vulnerability detection method and server

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170215

RJ01 Rejection of invention patent application after publication