CN110555146A

CN110555146A - method and system for generating network crawler camouflage data

Info

Publication number: CN110555146A
Application number: CN201810270472.7A
Authority: CN
Inventors: 余姚霖; 齐保元; 周美林; 王斌; 李鹏
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2018-03-29
Filing date: 2018-03-29
Publication date: 2019-12-10

Abstract

The invention provides a method for generating network crawler camouflage data, which comprises the following steps: selecting a usergent string from a user agent usergent string library of the browser, and acquiring an agent IP from an agent IP resource pool; accessing a target website by using the usergent string and the proxy IP, acquiring cookie information according to a cookie acquisition strategy and storing the cookie information in a cookie resource pool; storing the cookie information, the useragent character string, the proxy IP, a plurality of same-city proxy IPs and the referrer information in an associated manner, and packaging the cookie information into disguised data; and providing the disguised data for an external crawler program to use according to a scheduling strategy. The invention also provides a generation system of the network crawler camouflage data.

Description

method and system for generating network crawler camouflage data

Technical Field

The invention relates to the field of network information capture, in particular to a method and a system for generating network crawler camouflage data.

Background

The rapid development of the internet enables the internet to gather massive user data. The mass data on the internet and the self-correlation are captured, and the data are analyzed, so that a valuable data result can be generated, which is the premise and the basis of public opinion analysis. However, since each search engine and portal site does not welcome a crawler access from the viewpoint of the benefit of its own enterprise, a policy program is developed to block the crawler while ensuring that access by normal natural users is not hindered, but this also results in failure of normal data research such as public opinion analysis and public data research.

When the network crawler program is blocked by an internet enterprise, the generally adopted principle is that the access of the crawler program is blocked, so that normal natural users are not influenced, and meanwhile, the cost of blocking measures is considered to be within a reasonable range. Therefore, in order to realize network data capture and develop normal data research such as public opinion analysis and public data investigation, one of the methods is to study how to make the network behavior and network traces of the crawler program in the internet be the same as those of natural people, and in summary, to say, to study how to make the crawler program collect data in an anthropomorphic way on the internet.

at present, a main mode for disguising a crawler program is to reduce the access frequency of each IP by switching proxy IPs; setting a time interval of the access request; or manually adding a cookie in the access request; the modes only simulate the operation of accessing the Internet by natural users on one aspect, and if the blocking strategy of the Internet site is changed, the crawler codes need to be modified again, and the method does not have lasting stability.

Disclosure of Invention

In view of the above defects, the invention provides a method and a system for generating web crawler camouflage data, which are used for automatically generating the web crawler camouflage data in batches, and simulating real natural users to access the internet in a multi-dimension manner, so that access request behaviors of the web crawler on the internet are not easily identified by a target website, and normal data acquisition such as public opinion analysis and public data investigation is realized.

in order to solve the technical problems, the invention adopts the following technical scheme:

A method for generating network crawler camouflage data comprises the following steps:

Selecting a usergent string from a user agent usergent string library of the browser, and acquiring an agent IP from an agent IP resource pool;

Accessing a target website by using the usergent string and the proxy IP, acquiring cookie information according to a cookie acquisition strategy and storing the cookie information in a cookie resource pool;

Storing the cookie information, the useragent character string, the proxy IP, a plurality of same-city proxy IPs and the referrer information in an associated manner, and packaging the cookie information into disguised data;

And providing the disguised data for an external crawler program to use according to a scheduling strategy.

furthermore, indexes are established for the user generator string classification according to data such as the type of an operating system, the mark of a rendering engine, version information and the like.

further, the useragent string is filled into the user-agent field of the header file header of the access http request.

Further, the agent IP and the agent IP in the same city are stored according to the geographic position of the IP according to the index of the two levels of geographic positions in the city and the county.

further, the cookie retrieval policy includes:

1) For the website which can be accessed only by logging in, accessing by using user identity authentication information such as a user name and a password; for the website which can be accessed without logging in, the website can be directly accessed without containing user identity authentication information;

2) Obtaining operation information by the cookie, wherein the operation information comprises operation information such as a selection drop-down box, a click button and an input text of the cookie, sequence information between operations and a path position of a webpage element related to the operations on the html webpage, and the path position can be represented by an xpath selector, a css selector or a webpage path expression mode;

3) and the verification code cracking strategy comprises an external code printing platform, image OCR recognition, sliding verification code recognition and the like.

Further, by using a browser automatic operation program, driving the browser program and loading the cookie acquisition strategy, the cookie for accessing the target website is acquired.

Further, the same city proxy IP is generated according to the proxy IP.

Further, the referrer information is set together with the cookie acquiring policy, and includes a home page URL of the target website.

further, the scheduling policy includes:

1) The external program is called at a certain frequency within a certain period of time;

2) cookies are continuously called by external programs at a certain frequency;

3) cookies are called by external programs at fixed periods of time each day.

a system for generating webcrawler masquerading data, comprising:

The user agent useragent string library of the browser is used for storing useragent strings;

the proxy IP resource pool is used for providing and maintaining a proxy IP;

the cookie resource pool is used for storing and maintaining cookie information of the target website;

the cookie collector is used for automatically obtaining cookie information according to a cookie obtaining strategy and storing the cookie information in the cookie resource pool, and breaking the verification code of the target website by using a verification code breaking strategy; uniformly packaging the cookie information, the proxy IP, the usergent string and the referrer information into disguised data for storage;

and the external acquisition interface is used for providing the disguised data for an external crawler program to use according to a scheduling strategy.

According to the design of an http protocol, a network data request of a natural person user has some characteristics, including that a cookie is user identity information stored in a client by a web server, and the web server can detect whether a current access request has legal authority or not according to cookie information carried by the access request; the referrer records the URL page from which the current webpage request is jumped, and the IP address shows the address of which network machine the network request comes from; useragent records what version of browser software the network request was sent on.

The invention integrates proxy IP resources, browser version information resources, stores identity authentication data of a website such as a user name and a password, and stores login strategy information of the website, constructs a set of crawler personification camouflage data, and sends the packaged personification camouflage data to an external crawler system for use in an external interface mode, wherein the crawler system can be selectively used in the packaged personification camouflage data, disguises a crawler program acquisition request as an access request of a natural human user, and normally accesses a network. The method ensures that the coupling degree of the crawler program and the camouflage data generation system is low, can be generally applied and can reduce the interception possibility.

Drawings

FIG. 1 is a block diagram of a web crawler masquerading data generation system.

FIG. 2 is a flow chart of a method for generating Web crawler masquerading data.

Detailed Description

in order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.

The embodiment provides a method for generating web crawler masquerading data and a system for implementing the method, wherein the system architecture is shown in fig. 1, and the method is shown in fig. 2.

(1) A plurality of browsers are installed on the existing server for use, and the system runs based on the browsers.

(2) The method comprises the steps of collecting user agent user strings of a browser as much as possible and storing the collected user agent strings in a system, establishing a user agent string library which can be called as a UA user agent set, and storing the collected user agent strings in the system. And for the user's character strings in the set, establishing indexes according to data classification such as the type of an operating system, a rendering engine mark, version information and the like, so that the user's character strings meeting the requirements can be obtained according to the characteristics.

(3) And establishing an agent IP resource pool in the system, and for the obtained agent IP, storing the agent IP according to the geographic position of the IP according to a two-level geographic position index, wherein the geographic position is accurate to the city/county level, such as Huanggang Hubei. The inside of the system can generate proxy IP addresses in the same city/county according to one IP.

(4) and (3) cookie collection task encapsulation, wherein the system acquires an IP from an agent IP resource pool, acquires a user agent user string from the UA user agent set, and encapsulates the collection task according to a cookie collection strategy. And pushing the packaged acquisition tasks into a task queue.

the cookie retrieval policy comprises:

i. The identity authentication information of the user can be used for accessing website data only by inputting identity information such as a user name and a password for a website needing login access, and can comprise the identity authentication information for the website not needing login access;

Cookie obtaining operation information, recording operations of cookie obtaining, such as selecting a drop-down box, clicking a button, inputting a text and the like, the sequence of the operations, and the path position of a webpage element related to the operations on the html webpage, wherein the path position can be expressed by an xpath selector, a css selector or a webpage path expression mode. The cookie acquisition operation information is the indispensable content of the cookie acquisition policy;

and iii, a verification code cracking strategy, namely aiming at the strategies of the simple picture verification codes, the complex picture verification codes and the slider picture verification codes of the webpages needing to input the verification codes, wherein the strategies comprise an external code printing platform, picture OCR recognition and sliding verification code recognition.

(5) and establishing a cookie collecting subsystem which is a core part of the system, acquiring tasks from the task queue, and collecting cookie information of the target website, wherein the cookie information is an important component of the network crawler camouflage data.

(6) The server has installed thereon a plurality of browser programs, such as phantoms, Chrome, Firefox, etc. The cookie collector loads a cookie acquisition strategy through the browser driving module, drives a browser program, and acquires a cookie for accessing a target website according to the cookie strategy, so that the cookie of one target website is obtained.

(7) in the process of obtaining the target website cookie, the system selects a user string serving as the current access request from an internal browser user agent user string library and fills the user string into a user-agent field of a header of the access http request, and the user string and the returned cookie are stored together in an associated manner.

(8) in the process of obtaining the cookie of the target website, an IP is obtained from an agent IP resource pool, after cookie information is obtained through the IP, the agent IP is stored in a correlated mode when the obtained cookie is stored. In addition, several co-located proxy IPs of this IP are associated and stored together. The conditions that the internet access equipment of the same natural person user accesses the internet under different network environments are simulated, for example, the mobile equipment or the notebook computer equipment is connected with different network WiFi, the notebook computer equipment is connected with network cables in different network environments, and the like.

(9) when the cookie information is externally issued, the cookie, the usergent, the proxy IP and the same city proxy IP are externally issued at the same time. The multi-dimensional crawler camouflage information is richer and more detailed.

(10) and cookie scheduling: in order to prevent the single cookie from being frequently collected, when cookie disguised data is issued outside in the system, the cookie data can be scheduled in turn according to different scheduling strategies for use by an external program.

The scheduling strategy comprises:

i. And may be called by an external program at a certain frequency for a certain period of time (e.g., a certain two hours).

Cookies persist for external program calls at some frequency.

cookies may be called by external programs at fixed periods of time each day.

Different user habits of different internet surfing of natural users are simulated through different scheduling strategies, and blocking of crawler programs by target websites is reduced.

in addition, the effectiveness of the acquired crawler camouflage data is verified in a camouflage mode regularly, invalid data is eliminated, and new camouflage data is regenerated according to the strategy.

The method is adopted to generate multi-dimensional network crawler identity data, which comprises cookie, IP, access frequency, client browser version information, source page information of the network request and other information, and the information is packaged and sent to the outside, so that an external crawler program pretends to be an access request of a natural user in the process of capturing network data, and a blocking strategy for resisting an internet background service program to the crawler program access request is achieved.

the above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A method for generating network crawler camouflage data comprises the following steps:

2. The method of claim 1, wherein the index is established for the useragent string classification according to the operating system type, the rendering engine flag, and the version information data.

3. The method of claim 1, wherein the useragent string is populated into a user-agent field of a header file header of an access http request.

4. The method of claim 1, wherein the proxy IP and the co-located proxy IP are indexed by city-county two-level geographic location according to the geographic location of the IP.

5. The method of claim 1, wherein the cookie retrieval policy comprises:

1) For the website which can be accessed only by logging in, accessing by using user identity authentication information comprising a user name and a password; the website which can be accessed without logging in can be directly accessed;

2) Obtaining operation information by the cookie, wherein the operation information comprises a selection drop-down box for obtaining the cookie, a click button, operation information for inputting a text, sequence information between operations and a path position of a webpage element related to the operations on the html webpage, and the path position can be represented by an xpath selector, a css selector or a webpage path expression mode;

3) And the verification code cracking strategy comprises an external code printing platform, image OCR recognition and sliding verification code recognition.

6. The method of claim 1, wherein the cookie for accessing the target website is obtained by loading a cookie obtaining policy using a browser automation running program, a browser driving program, and the like.

7. The method of claim 1, wherein the sibling proxy IP is generated from the proxy IP.

8. The method as claimed in claim 1, wherein the referrer information is set together with the cookie retrieval policy, and includes a home URL of a target website.

9. The method of claim 1, wherein the scheduling policy comprises:

2) cookies are continuously called by external programs at a certain frequency;

3) cookies are called by external programs at fixed periods of time each day.

10. a system for generating webcrawler masquerading data, comprising:

The proxy IP resource pool is used for providing and maintaining a proxy IP;