CN110555146A - method and system for generating network crawler camouflage data - Google Patents

method and system for generating network crawler camouflage data Download PDF

Info

Publication number
CN110555146A
CN110555146A CN201810270472.7A CN201810270472A CN110555146A CN 110555146 A CN110555146 A CN 110555146A CN 201810270472 A CN201810270472 A CN 201810270472A CN 110555146 A CN110555146 A CN 110555146A
Authority
CN
China
Prior art keywords
cookie
information
proxy
data
string
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810270472.7A
Other languages
Chinese (zh)
Inventor
余姚霖
齐保元
周美林
王斌
李鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201810270472.7A priority Critical patent/CN110555146A/en
Publication of CN110555146A publication Critical patent/CN110555146A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/50Address allocation
    • H04L61/5007Internet protocol [IP] addresses
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/14Session management
    • H04L67/146Markers for unambiguous identification of a particular session, e.g. session cookie or URL-encoding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L2101/00Indexing scheme associated with group H04L61/00
    • H04L2101/60Types of network addresses
    • H04L2101/69Types of network addresses using geographic information, e.g. room number
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/50Address allocation
    • H04L61/5053Lease time; Renewal aspects

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a method for generating network crawler camouflage data, which comprises the following steps: selecting a usergent string from a user agent usergent string library of the browser, and acquiring an agent IP from an agent IP resource pool; accessing a target website by using the usergent string and the proxy IP, acquiring cookie information according to a cookie acquisition strategy and storing the cookie information in a cookie resource pool; storing the cookie information, the useragent character string, the proxy IP, a plurality of same-city proxy IPs and the referrer information in an associated manner, and packaging the cookie information into disguised data; and providing the disguised data for an external crawler program to use according to a scheduling strategy. The invention also provides a generation system of the network crawler camouflage data.

Description

method and system for generating network crawler camouflage data
Technical Field
The invention relates to the field of network information capture, in particular to a method and a system for generating network crawler camouflage data.
Background
The rapid development of the internet enables the internet to gather massive user data. The mass data on the internet and the self-correlation are captured, and the data are analyzed, so that a valuable data result can be generated, which is the premise and the basis of public opinion analysis. However, since each search engine and portal site does not welcome a crawler access from the viewpoint of the benefit of its own enterprise, a policy program is developed to block the crawler while ensuring that access by normal natural users is not hindered, but this also results in failure of normal data research such as public opinion analysis and public data research.
When the network crawler program is blocked by an internet enterprise, the generally adopted principle is that the access of the crawler program is blocked, so that normal natural users are not influenced, and meanwhile, the cost of blocking measures is considered to be within a reasonable range. Therefore, in order to realize network data capture and develop normal data research such as public opinion analysis and public data investigation, one of the methods is to study how to make the network behavior and network traces of the crawler program in the internet be the same as those of natural people, and in summary, to say, to study how to make the crawler program collect data in an anthropomorphic way on the internet.
at present, a main mode for disguising a crawler program is to reduce the access frequency of each IP by switching proxy IPs; setting a time interval of the access request; or manually adding a cookie in the access request; the modes only simulate the operation of accessing the Internet by natural users on one aspect, and if the blocking strategy of the Internet site is changed, the crawler codes need to be modified again, and the method does not have lasting stability.
Disclosure of Invention
In view of the above defects, the invention provides a method and a system for generating web crawler camouflage data, which are used for automatically generating the web crawler camouflage data in batches, and simulating real natural users to access the internet in a multi-dimension manner, so that access request behaviors of the web crawler on the internet are not easily identified by a target website, and normal data acquisition such as public opinion analysis and public data investigation is realized.
in order to solve the technical problems, the invention adopts the following technical scheme:
A method for generating network crawler camouflage data comprises the following steps:
Selecting a usergent string from a user agent usergent string library of the browser, and acquiring an agent IP from an agent IP resource pool;
Accessing a target website by using the usergent string and the proxy IP, acquiring cookie information according to a cookie acquisition strategy and storing the cookie information in a cookie resource pool;
Storing the cookie information, the useragent character string, the proxy IP, a plurality of same-city proxy IPs and the referrer information in an associated manner, and packaging the cookie information into disguised data;
And providing the disguised data for an external crawler program to use according to a scheduling strategy.
furthermore, indexes are established for the user generator string classification according to data such as the type of an operating system, the mark of a rendering engine, version information and the like.
further, the useragent string is filled into the user-agent field of the header file header of the access http request.
Further, the agent IP and the agent IP in the same city are stored according to the geographic position of the IP according to the index of the two levels of geographic positions in the city and the county.
further, the cookie retrieval policy includes:
1) For the website which can be accessed only by logging in, accessing by using user identity authentication information such as a user name and a password; for the website which can be accessed without logging in, the website can be directly accessed without containing user identity authentication information;
2) Obtaining operation information by the cookie, wherein the operation information comprises operation information such as a selection drop-down box, a click button and an input text of the cookie, sequence information between operations and a path position of a webpage element related to the operations on the html webpage, and the path position can be represented by an xpath selector, a css selector or a webpage path expression mode;
3) and the verification code cracking strategy comprises an external code printing platform, image OCR recognition, sliding verification code recognition and the like.
Further, by using a browser automatic operation program, driving the browser program and loading the cookie acquisition strategy, the cookie for accessing the target website is acquired.
Further, the same city proxy IP is generated according to the proxy IP.
Further, the referrer information is set together with the cookie acquiring policy, and includes a home page URL of the target website.
further, the scheduling policy includes:
1) The external program is called at a certain frequency within a certain period of time;
2) cookies are continuously called by external programs at a certain frequency;
3) cookies are called by external programs at fixed periods of time each day.
a system for generating webcrawler masquerading data, comprising:
The user agent useragent string library of the browser is used for storing useragent strings;
the proxy IP resource pool is used for providing and maintaining a proxy IP;
the cookie resource pool is used for storing and maintaining cookie information of the target website;
the cookie collector is used for automatically obtaining cookie information according to a cookie obtaining strategy and storing the cookie information in the cookie resource pool, and breaking the verification code of the target website by using a verification code breaking strategy; uniformly packaging the cookie information, the proxy IP, the usergent string and the referrer information into disguised data for storage;
and the external acquisition interface is used for providing the disguised data for an external crawler program to use according to a scheduling strategy.
According to the design of an http protocol, a network data request of a natural person user has some characteristics, including that a cookie is user identity information stored in a client by a web server, and the web server can detect whether a current access request has legal authority or not according to cookie information carried by the access request; the referrer records the URL page from which the current webpage request is jumped, and the IP address shows the address of which network machine the network request comes from; useragent records what version of browser software the network request was sent on.
The invention integrates proxy IP resources, browser version information resources, stores identity authentication data of a website such as a user name and a password, and stores login strategy information of the website, constructs a set of crawler personification camouflage data, and sends the packaged personification camouflage data to an external crawler system for use in an external interface mode, wherein the crawler system can be selectively used in the packaged personification camouflage data, disguises a crawler program acquisition request as an access request of a natural human user, and normally accesses a network. The method ensures that the coupling degree of the crawler program and the camouflage data generation system is low, can be generally applied and can reduce the interception possibility.
Drawings
FIG. 1 is a block diagram of a web crawler masquerading data generation system.
FIG. 2 is a flow chart of a method for generating Web crawler masquerading data.
Detailed Description
in order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.
The embodiment provides a method for generating web crawler masquerading data and a system for implementing the method, wherein the system architecture is shown in fig. 1, and the method is shown in fig. 2.
(1) A plurality of browsers are installed on the existing server for use, and the system runs based on the browsers.
(2) The method comprises the steps of collecting user agent user strings of a browser as much as possible and storing the collected user agent strings in a system, establishing a user agent string library which can be called as a UA user agent set, and storing the collected user agent strings in the system. And for the user's character strings in the set, establishing indexes according to data classification such as the type of an operating system, a rendering engine mark, version information and the like, so that the user's character strings meeting the requirements can be obtained according to the characteristics.
(3) And establishing an agent IP resource pool in the system, and for the obtained agent IP, storing the agent IP according to the geographic position of the IP according to a two-level geographic position index, wherein the geographic position is accurate to the city/county level, such as Huanggang Hubei. The inside of the system can generate proxy IP addresses in the same city/county according to one IP.
(4) and (3) cookie collection task encapsulation, wherein the system acquires an IP from an agent IP resource pool, acquires a user agent user string from the UA user agent set, and encapsulates the collection task according to a cookie collection strategy. And pushing the packaged acquisition tasks into a task queue.
the cookie retrieval policy comprises:
i. The identity authentication information of the user can be used for accessing website data only by inputting identity information such as a user name and a password for a website needing login access, and can comprise the identity authentication information for the website not needing login access;
Cookie obtaining operation information, recording operations of cookie obtaining, such as selecting a drop-down box, clicking a button, inputting a text and the like, the sequence of the operations, and the path position of a webpage element related to the operations on the html webpage, wherein the path position can be expressed by an xpath selector, a css selector or a webpage path expression mode. The cookie acquisition operation information is the indispensable content of the cookie acquisition policy;
and iii, a verification code cracking strategy, namely aiming at the strategies of the simple picture verification codes, the complex picture verification codes and the slider picture verification codes of the webpages needing to input the verification codes, wherein the strategies comprise an external code printing platform, picture OCR recognition and sliding verification code recognition.
(5) and establishing a cookie collecting subsystem which is a core part of the system, acquiring tasks from the task queue, and collecting cookie information of the target website, wherein the cookie information is an important component of the network crawler camouflage data.
(6) The server has installed thereon a plurality of browser programs, such as phantoms, Chrome, Firefox, etc. The cookie collector loads a cookie acquisition strategy through the browser driving module, drives a browser program, and acquires a cookie for accessing a target website according to the cookie strategy, so that the cookie of one target website is obtained.
(7) in the process of obtaining the target website cookie, the system selects a user string serving as the current access request from an internal browser user agent user string library and fills the user string into a user-agent field of a header of the access http request, and the user string and the returned cookie are stored together in an associated manner.
(8) in the process of obtaining the cookie of the target website, an IP is obtained from an agent IP resource pool, after cookie information is obtained through the IP, the agent IP is stored in a correlated mode when the obtained cookie is stored. In addition, several co-located proxy IPs of this IP are associated and stored together. The conditions that the internet access equipment of the same natural person user accesses the internet under different network environments are simulated, for example, the mobile equipment or the notebook computer equipment is connected with different network WiFi, the notebook computer equipment is connected with network cables in different network environments, and the like.
(9) when the cookie information is externally issued, the cookie, the usergent, the proxy IP and the same city proxy IP are externally issued at the same time. The multi-dimensional crawler camouflage information is richer and more detailed.
(10) and cookie scheduling: in order to prevent the single cookie from being frequently collected, when cookie disguised data is issued outside in the system, the cookie data can be scheduled in turn according to different scheduling strategies for use by an external program.
The scheduling strategy comprises:
i. And may be called by an external program at a certain frequency for a certain period of time (e.g., a certain two hours).
Cookies persist for external program calls at some frequency.
cookies may be called by external programs at fixed periods of time each day.
Different user habits of different internet surfing of natural users are simulated through different scheduling strategies, and blocking of crawler programs by target websites is reduced.
in addition, the effectiveness of the acquired crawler camouflage data is verified in a camouflage mode regularly, invalid data is eliminated, and new camouflage data is regenerated according to the strategy.
The method is adopted to generate multi-dimensional network crawler identity data, which comprises cookie, IP, access frequency, client browser version information, source page information of the network request and other information, and the information is packaged and sent to the outside, so that an external crawler program pretends to be an access request of a natural user in the process of capturing network data, and a blocking strategy for resisting an internet background service program to the crawler program access request is achieved.
the above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (10)

1. A method for generating network crawler camouflage data comprises the following steps:
Selecting a usergent string from a user agent usergent string library of the browser, and acquiring an agent IP from an agent IP resource pool;
Accessing a target website by using the usergent string and the proxy IP, acquiring cookie information according to a cookie acquisition strategy and storing the cookie information in a cookie resource pool;
storing the cookie information, the useragent character string, the proxy IP, a plurality of same-city proxy IPs and the referrer information in an associated manner, and packaging the cookie information into disguised data;
and providing the disguised data for an external crawler program to use according to a scheduling strategy.
2. The method of claim 1, wherein the index is established for the useragent string classification according to the operating system type, the rendering engine flag, and the version information data.
3. The method of claim 1, wherein the useragent string is populated into a user-agent field of a header file header of an access http request.
4. The method of claim 1, wherein the proxy IP and the co-located proxy IP are indexed by city-county two-level geographic location according to the geographic location of the IP.
5. The method of claim 1, wherein the cookie retrieval policy comprises:
1) For the website which can be accessed only by logging in, accessing by using user identity authentication information comprising a user name and a password; the website which can be accessed without logging in can be directly accessed;
2) Obtaining operation information by the cookie, wherein the operation information comprises a selection drop-down box for obtaining the cookie, a click button, operation information for inputting a text, sequence information between operations and a path position of a webpage element related to the operations on the html webpage, and the path position can be represented by an xpath selector, a css selector or a webpage path expression mode;
3) And the verification code cracking strategy comprises an external code printing platform, image OCR recognition and sliding verification code recognition.
6. The method of claim 1, wherein the cookie for accessing the target website is obtained by loading a cookie obtaining policy using a browser automation running program, a browser driving program, and the like.
7. The method of claim 1, wherein the sibling proxy IP is generated from the proxy IP.
8. The method as claimed in claim 1, wherein the referrer information is set together with the cookie retrieval policy, and includes a home URL of a target website.
9. The method of claim 1, wherein the scheduling policy comprises:
1) the external program is called at a certain frequency within a certain period of time;
2) cookies are continuously called by external programs at a certain frequency;
3) cookies are called by external programs at fixed periods of time each day.
10. a system for generating webcrawler masquerading data, comprising:
The user agent useragent string library of the browser is used for storing useragent strings;
The proxy IP resource pool is used for providing and maintaining a proxy IP;
the cookie resource pool is used for storing and maintaining cookie information of the target website;
The cookie collector is used for automatically obtaining cookie information according to a cookie obtaining strategy and storing the cookie information in the cookie resource pool, and breaking the verification code of the target website by using a verification code breaking strategy; uniformly packaging the cookie information, the proxy IP, the usergent string and the referrer information into disguised data for storage;
and the external acquisition interface is used for providing the disguised data for an external crawler program to use according to a scheduling strategy.
CN201810270472.7A 2018-03-29 2018-03-29 method and system for generating network crawler camouflage data Pending CN110555146A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810270472.7A CN110555146A (en) 2018-03-29 2018-03-29 method and system for generating network crawler camouflage data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810270472.7A CN110555146A (en) 2018-03-29 2018-03-29 method and system for generating network crawler camouflage data

Publications (1)

Publication Number Publication Date
CN110555146A true CN110555146A (en) 2019-12-10

Family

ID=68733637

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810270472.7A Pending CN110555146A (en) 2018-03-29 2018-03-29 method and system for generating network crawler camouflage data

Country Status (1)

Country Link
CN (1) CN110555146A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111711617A (en) * 2020-05-29 2020-09-25 北京金山云网络技术有限公司 Method and device for detecting web crawler, electronic equipment and storage medium
CN111865977A (en) * 2020-07-20 2020-10-30 北京丁牛科技有限公司 Information processing method and system
CN112528120A (en) * 2020-12-21 2021-03-19 北京中安智达科技有限公司 Method for web data crawler to use browser to divide body and proxy
CN113297449A (en) * 2021-05-21 2021-08-24 南京大学 Method and system for realizing streaming crawler
CN113505287A (en) * 2021-06-24 2021-10-15 微梦创科网络科技(中国)有限公司 Website link detection method and system
CN114168831A (en) * 2021-11-04 2022-03-11 无锡知产云信息技术有限公司 Network data acquisition method and device, terminal and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040049673A1 (en) * 2002-09-05 2004-03-11 Docomo Communications Laboratories Usa, Inc. Apparatus and method for a personal cookie repository service for cookie management among multiple devices
CN105320740A (en) * 2015-09-22 2016-02-10 清华大学 WeChat article and official account acquisition method and acquisition system
CN106897357A (en) * 2017-01-04 2017-06-27 北京京拍档科技股份有限公司 A kind of method for crawling the network information for band checking distributed intelligence
CN107395782A (en) * 2017-07-19 2017-11-24 北京理工大学 A kind of IP limitation controlled source information extraction methods based on agent pool

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040049673A1 (en) * 2002-09-05 2004-03-11 Docomo Communications Laboratories Usa, Inc. Apparatus and method for a personal cookie repository service for cookie management among multiple devices
CN105320740A (en) * 2015-09-22 2016-02-10 清华大学 WeChat article and official account acquisition method and acquisition system
CN106897357A (en) * 2017-01-04 2017-06-27 北京京拍档科技股份有限公司 A kind of method for crawling the network information for band checking distributed intelligence
CN107395782A (en) * 2017-07-19 2017-11-24 北京理工大学 A kind of IP limitation controlled source information extraction methods based on agent pool

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘阳: "基于网络爬虫的CMS识别系统的研究与应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111711617A (en) * 2020-05-29 2020-09-25 北京金山云网络技术有限公司 Method and device for detecting web crawler, electronic equipment and storage medium
CN111865977A (en) * 2020-07-20 2020-10-30 北京丁牛科技有限公司 Information processing method and system
CN112528120A (en) * 2020-12-21 2021-03-19 北京中安智达科技有限公司 Method for web data crawler to use browser to divide body and proxy
CN113297449A (en) * 2021-05-21 2021-08-24 南京大学 Method and system for realizing streaming crawler
CN113505287A (en) * 2021-06-24 2021-10-15 微梦创科网络科技(中国)有限公司 Website link detection method and system
CN114168831A (en) * 2021-11-04 2022-03-11 无锡知产云信息技术有限公司 Network data acquisition method and device, terminal and storage medium

Similar Documents

Publication Publication Date Title
CN110555146A (en) method and system for generating network crawler camouflage data
US9680856B2 (en) System and methods for scalably identifying and characterizing structural differences between document object models
US10262142B2 (en) Systems and methods for advanced dynamic analysis scanning
KR101298888B1 (en) Mobile sitemaps
CN110266661B (en) Authorization method, device and equipment
CN106095979B (en) URL merging processing method and device
US20140222404A1 (en) Computer-implemented psuedo-browsing
US20120016857A1 (en) System and method for providing search engine optimization analysis
CN106528657A (en) Control method and device for browser skipping to application program
CN103888490A (en) Automatic WEB client man-machine identification method
CN104468790A (en) Method for processing cookie data and client side
Langhnoja et al. Pre-processing: procedure on web log file for web usage mining
KR20100095208A (en) System and method for tracking webpage click information
Gheorghe et al. Modern techniques of web scraping for data scientists
CN103546330A (en) Method, device and system for detecting compatibilities of browsers
EP3309701A1 (en) Systems and methods for anonymous construction and indexing of visitor databases using first-party cookies
US20080065677A1 (en) Analyzing web site traffic
Chen et al. Interactive web caching for slow or intermittent networks
CN105159992A (en) Method and device for detecting page contents and network behaviors of application program
Fang et al. Fine-grained HTTP web traffic analysis based on large-scale mobile datasets
CN108268370B (en) Website quality analysis method, device and system based on Referer and template library matching
RU2709647C9 (en) Method of associating a domain name with a characteristic of visiting a website
RU2669172C2 (en) Method and monitoring system of web-site consistency
Panum et al. Kraaler: A user-perspective web crawler
Kumar et al. A brief investigation on web usage mining tools (WUM)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191210