CN110781367B

CN110781367B - Internet data acquisition method and system based on middleman

Info

Publication number: CN110781367B
Application number: CN201910909270.7A
Authority: CN
Inventors: 程学旗; 史存会; 胡耀康; 朱运昌; 俞晓明; 刘悦
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2019-09-25
Filing date: 2019-09-25
Publication date: 2023-10-20
Anticipated expiration: 2039-09-25
Also published as: CN110781367A

Abstract

The invention provides an internet data acquisition method and system based on a middleman, comprising the following steps: the method comprises the steps that a man-in-the-middle proxy certificate is installed to a webpage information acquisition device, a man-in-the-middle of the webpage information acquisition device is established, and when the webpage information acquisition device accesses webpage information in the Internet, the man-in-the-middle proxy agent agents all network flow of the webpage information acquisition device; acquiring an acquisition task containing a URL regular expression of a webpage to be acquired by an intermediate person, capturing the flows which accord with the URL regular expression in all network flows, taking the acquired flows as intermediate flows, injecting the acquisition task into an HTML page of the intermediate flows, obtaining a page to be analyzed, and storing the page to be analyzed into a first database; and the analysis module distributes the page to be analyzed to the analyzer instance for analysis according to the URL information of the page to be analyzed in the first database, acquires a webpage acquisition result containing the structured data from the analysis module and stores the webpage acquisition result into the second database. The invention can support the data acquisition of all applications which rely on the integrated browser kernel function to provide information.

Description

Internet data acquisition method and system based on middleman

Technical Field

The invention relates to the field of web crawlers, in particular to a data acquisition method and system based on man-in-the-middle attack, which can continuously inject different task codes into different application programs in a manner of modifying flow data attack by man-in-the-middle agent to complete requests for different pages and acquire related data.

Background

The web crawler can effectively use various existing resources to automatically capture a large amount of web page information in the internet, and is sometimes called as a web Spider (Spider). However, with the popularization of the mobile internet, more traffic is directly distributed through various different terminal applications, and the WEB access is not provided or is limited, so that great difficulty is brought to data acquisition.

In the crawling process of the WEB crawler, the method comprises the steps of acquiring a request URL, sending a WEB request to download a page, analyzing structured data from the WEB page, filtering repeated data and processing seed tasks, wherein the total number of links is 5, each link consumes resources differently, and the efficiency and stability of the whole crawler system can be influenced when each link has a problem. In addition, with the growing trend of internet technology, more and more information can not be obtained through traditional WEB channels, a large amount of information is spread through specific application programs, such as mobile information application, and the like, and a large amount of data is asynchronous request and encrypted data by using HTTPS, so that a general data acquisition system compatible with various applications and various data is lacking.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an internet data acquisition method based on a middleman, which comprises the following steps:

step 1, establishing an intermediate person of the webpage information acquisition equipment by installing an intermediate person proxy certificate to the webpage information acquisition equipment, wherein the intermediate person proxies all network traffic of the webpage information acquisition equipment when the webpage information acquisition equipment accesses webpage information in the Internet;

step 2, the man-in-the-middle acquires an acquisition task containing a URL regular expression of a webpage to be acquired, captures the flow conforming to the URL regular expression in all network flows, uses the flow as an intermediate flow, injects the acquisition task into an HTML page of the intermediate flow, and obtains the page to be analyzed and stores the page into a first database;

and 3, distributing the page to be analyzed to an analyzer instance for analysis by the analysis module according to the URL information of the page to be analyzed in the first database, acquiring a webpage acquisition result containing the structured data from the webpage acquisition result and storing the webpage acquisition result into the second database.

The internet data acquisition method based on the middleman comprises the following steps: and the man-in-the-middle decrypts the encrypted content in the network traffic according to the HTTPS security certificate configured by the webpage information acquisition equipment.

The internet data acquisition method based on the middleman, wherein the generation process of the acquisition task in the step 2 comprises the following steps: and generating the acquisition task according to the pre-configured seed information, or generating a new acquisition task according to the acquired webpage acquisition result.

The internet data acquisition method based on the middleman comprises the following steps: intercepting part of HTTP/HTTPS requests according to the configured URL regular expression, and returning empty content so as to improve acquisition efficiency.

The internet data acquisition method based on the middleman comprises the following steps of: an HTML page acquisition task and a dynamic content acquisition task; the HTML page acquisition task comprises a jump code, and jumps to the URL to be acquired next time; the dynamic content acquisition task not only comprises a jump code, but also comprises the step of acquiring corresponding interface parameters by using JavaScript codes and the step of using JavaScript codes of acquired pages.

The invention also provides an internet data acquisition system based on the middleman, which comprises:

the method comprises the steps that 1, an intermediate person of the webpage information acquisition equipment is established by installing an intermediate person proxy certificate to the webpage information acquisition equipment, and when the webpage information acquisition equipment accesses webpage information in the Internet, the intermediate person proxies all network flow of the webpage information acquisition equipment;

the module 2, the man-in-the-middle acquires an acquisition task containing a URL regular expression of a webpage to be acquired, captures the flow which accords with the URL regular expression in all network flows, uses the flow as an intermediate flow, injects the acquisition task into an HTML page of the intermediate flow, and obtains the page to be analyzed and stores the page into a first database;

and the module 3, the analysis module distributes the page to be analyzed to the analyzer instance for analysis according to the URL information of the page to be analyzed in the first database, acquires the webpage acquisition result containing the structured data from the analysis module and stores the webpage acquisition result into the second database.

The internet data acquisition system based on the middleman, wherein the module 2 comprises: and the man-in-the-middle decrypts the encrypted content in the network traffic according to the HTTPS security certificate configured by the webpage information acquisition equipment.

The internet data acquisition system based on the middleman, wherein the generation process of the acquisition task in the module 2 comprises the following steps: and generating the acquisition task according to the pre-configured seed information, or generating a new acquisition task according to the acquired webpage acquisition result.

The internet data acquisition system based on the middleman, wherein the module 2 comprises: intercepting part of HTTP/HTTPS requests according to the configured URL regular expression, and returning empty content so as to improve acquisition efficiency.

The internet data acquisition system based on the middleman comprises the following acquisition tasks in a module 2: an HTML page acquisition task and a dynamic content acquisition task; the HTML page acquisition task comprises a jump code, and jumps to the URL to be acquired next time; the dynamic content acquisition task not only comprises a jump code, but also comprises the step of acquiring corresponding interface parameters by using JavaScript codes and the step of using JavaScript codes of acquired pages.

The advantages of the invention are as follows:

the invention provides a data acquisition method and a system based on man-in-the-middle attack, which can support the data acquisition of all applications which provide information by means of integrating the kernel functions of a browser, and comprise various webpage request modes, and the analysis and configuration of structured data are flexible. The system has modularized and functional acquisition process, greatly improves the data grabbing efficiency, and greatly reduces the difficulty in data acquisition of various application programs.

The invention uses the man-in-the-middle attack technology in the data acquisition system, modularizes each processing link and has single function of each module, thereby improving the working efficiency of the whole system and enabling the horizontal expansion of the system to be more convenient and simpler. On the other hand, the invention introduces the Redis message queue for decoupling among modules, and the Redis also has the characteristics of high throughput, high availability and easy expansion, so the efficiency and the stability of the invention are greatly improved by introducing the Redis storage medium.

Drawings

FIG. 1 is a block diagram of a crawler system according to an embodiment of the present invention.

Detailed Description

In order to make the above features and effects of the present invention more clearly understood, the following specific examples are given with reference to the accompanying drawings.

The invention introduces an Anyproxy proxy tool to proxy all HTTP/HTTPS traffic of the client application, and ensures the decryption of HTTPS encrypted data by a method of installing a security certificate on the corresponding acquisition equipment in advance.

The technical scheme of the invention is as follows:

a data acquisition method based on man-in-the-middle attack comprises the following steps:

1) The part of the application and the equipment needing to be acquired is an acquisition main body, but only the agent of the man in the middle is configured, the agent certificate of the man in the middle is installed, and any page initialization is accessed on the application needing to be acquired.

2) The man-in-the-middle agent module is mainly responsible for implementing man-in-the-middle attack and comprises the following main contents:

a) Intercepting requests for filtering invalid traffic, including but not limited to CSS files, javaScript files, picture files, and other resource files.

b) The encrypted content is decrypted using a pre-configured HTTPS certificate.

c) Capturing the flow to be acquired according to the URL regular expression and storing the captured flow into a Redis database cache.

d) And requesting a new task from the task scheduling module, and injecting the task into the HTML page conforming to the URL regular expression. The tasks include web pages to be crawled next, but because of the client's closeness, it is difficult for the client to get to the new task directly, so the present invention communicates the task to the client in the form of a web page injected back to the client.

3) The analysis module distributes different pages to different analyzer examples for analysis according to URL information of Response data, structured data are obtained from the analysis module and stored in the MongoDB database, and data related to new tasks are stored in the Redis database for the task scheduling module to use to generate new tasks. Where Response data refers to the Response data of the application server to the client, this data is returned by the server.

4) The task scheduling module generates task content according to the basic information preconfigured according to the seed task and the like, and can also generate a new task according to the data analyzed by the analysis module; and de-duplicating the task according to the URL of the task, and recovering the failure of the task.

5) The data storage module is mainly responsible for storing related data and decoupling the functions of the modules. The decoupling is embodied in that the data storage module can record the data required to be written by other functional modules and can also enable the other functional modules to read the data required by the other functional modules, so that each functional module does not need to directly interact with the other functional modules and only needs to share the data storage module.

Further, step 1) mainly installs and configures the network address, the network port and the security certificate of the man-in-the-middle agent in the acquisition environment, and can start the whole acquisition process by accessing any initial page in the application needing acquisition.

Further, the man-in-the-middle agent module in step 2) comprises four main functions:

a) Because the client application device is provided with the security certificate issued by the man-in-the-middle agent module, the client application device can intercept the HTTPS encrypted traffic of the client and view the plaintext content as a man-in-the-middle.

b) When the client Request reaches the broker, the broker module checks whether the URL of the Request meets the condition of being filtered, and if so, intercepts the Request to directly return the empty content. Otherwise, the Request is forwarded to the target server. For example, according to the configured URL regular expression, intercepting part of HTTP/HTTPs requests, including CSS files, javaScript, and picture files, and so on, to return empty content, and directly return the empty content, thereby improving the collection efficiency. Because CSS files are used for rendering graphics and occupy a large amount of computing resources, in addition, javaScript files consume a large amount of resources for code browser execution, and finally resource files like pictures and audio occupy a large amount of network bandwidth, the collection efficiency and the collection stability can be greatly improved after filtering and interception.

c) When the target server returns the corresponding Response, the broker agent checks whether the URL is the content to be collected, and if so, stores the entire Response in the Redis database.

d) When the target server returns a corresponding Response, if the content is an HTML page, the broker agent checks whether its URL matches a specific regular expression, and if so, requests a corresponding task from the task scheduling module and injects it into the < script > tag in the HTML page. And then transmits the Response to the client application program.

Further, the 3) parsing module is configured to take out the content to be parsed from the Redis data cache, allocate the content to different parser instances according to the URL thereof, store the structured data into the MongoDB after the resolution is completed, and store the information related to the next acquisition, such as URL, cookies, etc., into the Redis database cache.

Further, step 4) the task scheduler: the method mainly comprises task generation, task scheduling, task deduplication and task recovery.

Further, task generation is mainly divided into two generation modes: a task is generated according to pre-configured seed information, and a new task is generated according to acquired information; in addition, tasks are also largely divided into two types: a task is collected for a simple HTML page, and the task is data such as JSON and the like for dynamic information, and the task needs related task parameters, cookies and the like, and can complete collection by executing JavaScript related codes.

Further, task scheduling is mainly to allocate different acquisition tasks to different application programs according to the different application programs, and control the acquisition rate of the tasks to avoid being blocked.

Further, task deduplication is mainly performed according to the URL, and only a unified deduplication queue is needed in the Redis database, and each time a task is generated, whether the URL has been accessed is queried.

Further, the task recovery function needs to identify the task whose acquisition failed and schedule its recovery in due course.

The beneficial effects of the invention are as follows:

the data acquisition method and system based on the man-in-the-middle attack can support the data acquisition task of the application taking the browser kernel as the core, can be flexibly configured according to the URL regularization, is modularized and functionally functional in the crawling process, greatly improves the efficiency of data acquisition of the application program, and has good applicability and universality.

In fig. 1, a client application device is a device on which an application requiring information acquisition is installed, and an intermediary agent configuration needs to be configured on the device. After the configuration is completed, the man-in-the-middle agent module can be seen to proxy all traffic between the client device and the application server and timely implement 'man-in-the-middle attack', and the man-in-the-middle agent module can interact with the task scheduling module and the Redis database. And finally, analyzing the acquired data by an analyzer and storing the analyzed data into a MongoDB database.

The data acquisition system based on man-in-the-middle attack comprises five modules in the acquisition process, the processing process is modularized, the functions are unified, and Redis data caching decoupling is adopted among the modules, so that the efficiency of the crawler system is greatly improved. The stability of data grabbing is guaranteed, and meanwhile, the operation and maintenance cost of the system is greatly reduced.

The following is a system example corresponding to the above method example, and this embodiment mode may be implemented in cooperation with the above embodiment mode. The related technical details mentioned in the above embodiments are still valid in this embodiment, and in order to reduce repetition, they are not repeated here. Accordingly, the related technical details mentioned in the present embodiment can also be applied to the above-described embodiments.

The system for collecting the webpage information in the Internet, wherein the generation process of the collection task in the module 2 comprises the following steps: and generating the acquisition task according to the pre-configured seed information, or generating a new acquisition task according to the acquired webpage acquisition result.

The system for collecting webpage information in the internet, wherein the module 2 comprises: intercepting part of HTTP/HTTPS requests according to the configured URL regular expression, and returning empty content so as to improve acquisition efficiency.

The system for collecting the webpage information in the Internet, wherein the collection task in the module 2 comprises the following steps: an HTML page acquisition task and a dynamic content acquisition task; the HTML page acquisition task comprises a jump code, and jumps to the URL to be acquired next time; the dynamic content acquisition task not only comprises a jump code, but also comprises the step of acquiring corresponding interface parameters by using JavaScript codes and the step of using JavaScript codes of acquired pages.

Claims

1. An internet data acquisition method based on a man-in-the-middle is characterized by comprising the following steps:

step 2, the man-in-the-middle acquires an acquisition task containing a URL regular expression of a webpage to be acquired, captures an HTML page conforming to the URL regular expression in all network traffic, stores the HTML page as a webpage to be analyzed into a first database, injects a webpage to be crawled into the webpage to be analyzed, and returns the webpage to the webpage information acquisition equipment;

step 3, the analysis module distributes the page to be analyzed to the analyzer instance for analysis according to the URL information of the page to be analyzed in the first database, and acquires a webpage acquisition result containing the structured data from the analysis result and stores the webpage acquisition result into the second database;

the intermediate decrypts the encrypted content in the network traffic according to the HTTPS security certificate configured by the webpage information acquisition equipment;

the generating process of the acquisition task in the step 2 comprises the following steps: generating the acquisition task according to the pre-configured seed information, or generating a new acquisition task according to the acquired webpage acquisition result; and step 2 comprises: intercepting part of HTTP/HTTPS requests according to the configured URL regular expression, and returning blank content to improve acquisition efficiency, wherein the acquisition task in the step 2 comprises the following steps: an HTML page acquisition task and a dynamic content acquisition task; the HTML page acquisition task comprises a jump code, and jumps to the URL to be acquired next time; the dynamic content acquisition task not only comprises a jump code, but also comprises the step of acquiring corresponding interface parameters by using JavaScript codes and the step of using JavaScript codes of acquired pages.

2. An internet data acquisition system based on man-in-the-middle, comprising:

the module 2, the man-in-the-middle acquires an acquisition task containing a URL regular expression of a webpage to be acquired, captures an HTML page conforming to the URL regular expression in all network traffic, stores the HTML page as a webpage to be analyzed into a first database, and returns the webpage to be crawled after the webpage to be analyzed is injected into the webpage to be analyzed to webpage information acquisition equipment;

the module 3, the analysis module distributes the page to be analyzed to the analyzer instance for analysis according to the URL information of the page to be analyzed in the first database, acquires the web page acquisition result containing the structured data from the analysis module and stores the web page acquisition result into the second database;

the generation process of the acquisition task in the module 2 comprises the following steps: generating the acquisition task according to the pre-configured seed information or generating a new acquisition task according to the acquired webpage acquisition result, wherein the module 2 comprises: intercepting part of HTTP/HTTPS requests according to the configured URL regular expression, and returning empty content so as to improve acquisition efficiency; and the acquisition tasks in module 2 include: an HTML page acquisition task and a dynamic content acquisition task; the HTML page acquisition task comprises a jump code, and jumps to the URL to be acquired next time; the dynamic content acquisition task not only comprises a jump code, but also comprises the step of acquiring corresponding interface parameters by using JavaScript codes and the step of using JavaScript codes of acquired pages.