CN110781367A

CN110781367A - Internet data acquisition method and system based on man-in-the-middle

Info

Publication number: CN110781367A
Application number: CN201910909270.7A
Authority: CN
Inventors: 程学旗; 史存会; 胡耀康; 朱运昌; 俞晓明; 刘悦
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2019-09-25
Filing date: 2019-09-25
Publication date: 2020-02-11
Anticipated expiration: 2039-09-25
Also published as: CN110781367B

Abstract

The invention provides an internet data acquisition method and system based on a man-in-the-middle, comprising the following steps: establishing a broker of the web page information acquisition equipment by installing a broker proxy certificate to the web page information acquisition equipment, wherein the broker proxies all network traffic of the web page information acquisition equipment when the web page information acquisition equipment accesses web page information in the internet; the method comprises the steps that a middle person obtains a collection task containing a URL regular expression of a webpage to be collected, captures flow which is in accordance with the URL regular expression in all network flow and serves as middle flow, the collection task is injected into an HTML page of the middle flow, and the page to be analyzed is obtained and stored in a first database; and the analysis module distributes the page to be analyzed to an analyzer example for analysis according to the URL information of the page to be analyzed in the first database, and acquires a webpage acquisition result containing the structured data and stores the webpage acquisition result in the second database. The invention can support data acquisition of all applications that rely on the integrated browser kernel functionality to provide information.

Description

Internet data acquisition method and system based on man-in-the-middle

Technical Field

The invention relates to the field of web crawlers, in particular to a data acquisition method and a data acquisition system based on man-in-the-middle attack, which can continuously inject different task codes into different application programs in a way of modifying flow data attack by a man-in-the-middle agent to complete requests for different pages and acquire related data.

Background

A web crawler can use various existing resources to automatically capture a large amount of web page information on the internet, and is sometimes called a "web Spider (Spider)". However, with the popularization of the mobile internet, more traffic is directly distributed through various different terminal applications, and WEB access is not provided or is limited by partial data, so that great difficulty is brought to data acquisition.

The crawling process of the WEB crawler includes the steps of obtaining a request URL, sending a WEB request to download a page, analyzing structured data from the page, filtering repeated data and processing a seed task, wherein 5 links are counted, each link consumes different resources, and the efficiency and the stability of the whole crawler system are affected when each link goes wrong. In addition, with the change of internet technology, more and more information that cannot be obtained through the traditional WEB channel is more and more, and a large amount of information is spread through a specific application program, typically mobile information application, and the like, and a large amount of data is asynchronous requests, and encrypted data using HTTPS, a universal data acquisition system compatible with various applications and various types of data is lacking at present.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an internet data acquisition method based on a man-in-the-middle, which comprises the following steps:

step 1, establishing a broker of the webpage information acquisition equipment by installing a broker proxy certificate to the webpage information acquisition equipment, wherein the broker proxies all network traffic of the webpage information acquisition equipment when the webpage information acquisition equipment accesses webpage information in the internet;

step 2, the middle person obtains a collection task containing a URL regular expression of a webpage to be collected, captures the flow which is in accordance with the URL regular expression in all network flows and takes the flow as an intermediate flow, and injects the collection task into an HTML page of the intermediate flow to obtain a page to be analyzed and stores the page into a first database;

and 3, the analysis module distributes the page to be analyzed to an analyzer instance for analysis according to the URL information of the page to be analyzed in the first database, and acquires a webpage acquisition result containing the structured data and stores the webpage acquisition result in a second database.

The internet data acquisition method based on the man-in-the-middle, wherein the step 2 comprises the following steps: and the man-in-the-middle decrypts the encrypted content in the network flow according to the HTTPS security certificate configured by the webpage information acquisition equipment.

The internet data acquisition method based on the man-in-the-middle, wherein the generation process of the acquisition task in the step 2 comprises the following steps: and generating the acquisition task according to the preconfigured seed information, or generating a new acquisition task according to the acquired webpage acquisition result.

The internet data acquisition method based on the man-in-the-middle, wherein the step 2 comprises the following steps: intercepting part of HTTP/HTTPS requests according to the configured URL regular expression, and returning empty content to improve the collection efficiency.

The internet data acquisition method based on the man-in-the-middle, wherein the acquisition task in the step 2 comprises the following steps: HTML page collection task and dynamic content collection task; the HTML page acquisition task comprises a jump code, and a jump is made to a URL to be acquired next time; the dynamic content collection task not only comprises a jump code, but also comprises a JavaScript code which is used for obtaining corresponding interface parameters and a collected page.

The invention also provides an internet data acquisition system based on the man-in-the-middle, which comprises:

the module 1, through installing the broker's agent certificate to the information acquisition equipment of the webpage, set up the broker of the information acquisition equipment of the webpage, when the information acquisition equipment of the webpage visits the webpage information in the Internet, the broker acts on all network traffic of the information acquisition equipment of the webpage;

the module 2, the middle person obtains the collection task containing the URL regular expression of the webpage to be collected, captures the flow which is in accordance with the URL regular expression in all the network flows and is used as the middle flow, and injects the collection task into the HTML page of the middle flow to obtain the page to be analyzed and stores the page into the first database;

and the module 3 and the analysis module distribute the page to be analyzed to an analyzer instance for analysis according to the URL information of the page to be analyzed in the first database, and acquire a webpage acquisition result containing the structured data and store the webpage acquisition result in the second database.

The internet data acquisition system based on the man-in-the-middle, wherein the module 2 comprises: and the man-in-the-middle decrypts the encrypted content in the network flow according to the HTTPS security certificate configured by the webpage information acquisition equipment.

The internet data acquisition system based on the man-in-the-middle, wherein the generation process of the acquisition task in the module 2 comprises the following steps: and generating the acquisition task according to the preconfigured seed information, or generating a new acquisition task according to the acquired webpage acquisition result.

The internet data acquisition system based on the man-in-the-middle, wherein the module 2 comprises: intercepting part of HTTP/HTTPS requests according to the configured URL regular expression, and returning empty content to improve the collection efficiency.

The internet data acquisition system based on the man-in-the-middle, wherein the acquisition task in the module 2 comprises: HTML page collection task and dynamic content collection task; the HTML page acquisition task comprises a jump code, and a jump is made to a URL to be acquired next time; the dynamic content collection task not only comprises a jump code, but also comprises a JavaScript code which is used for obtaining corresponding interface parameters and a collected page.

According to the scheme, the invention has the advantages that:

the invention provides a data acquisition method and a data acquisition system based on man-in-the-middle attack, which can support the data acquisition of all applications which provide information by means of an integrated browser kernel function, and comprise various types of webpage request modes, and the structured data analysis configuration is flexible. The system has the advantages that the acquisition process is modularized and functional, the data capture efficiency is greatly improved, and the difficulty in acquiring various application program data is greatly reduced.

The invention applies the man-in-the-middle attack technology to the data acquisition system, modularizes each processing link and has single function of each module, thereby improving the working efficiency of the whole system and leading the horizontal expansion of the system to be more convenient and simpler. On the other hand, the Redis message queue is introduced to decouple modules, and the Redis has the characteristics of high throughput, high availability and easiness in expansion, so that the efficiency and the stability of the invention are greatly improved by introducing the Redis storage medium.

Drawings

FIG. 1 is an architecture diagram of a crawler system according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

The invention introduces an Anyproxy agent tool to act all HTTP/HTTPS flow of the client application, and ensures the decryption of HTTPS encrypted data by a method of installing a security certificate on corresponding acquisition equipment in advance.

The technical scheme of the invention is as follows:

a data acquisition method based on man-in-the-middle attack comprises the following steps:

1) the part of the application and the device which need to be collected is a collection main body, but the application and the device only need to be configured with a broker agent and install a broker agent certificate, and the application which needs to be collected can access any page initialization.

2) The man-in-the-middle agent module is mainly responsible for implementing man-in-the-middle attack and comprises the following main contents:

a) intercepting requests for filtering invalid traffic, wherein the requests include but are not limited to resource files such as CSS files, JavaScript files, picture files and the like.

b) The encrypted content is decrypted using the preconfigured HTTPS certificate.

c) And capturing the flow to be acquired according to the URL regular expression and storing the flow into a Redis database cache.

d) And requesting a new task from the task scheduling module, and injecting the task into an HTML page which accords with the URL regular expression. The tasks include the next web pages to be crawled, but because of the closeness of the client, the client has difficulty in directly taking new tasks, so the present invention communicates the tasks to the client in the form of injections into the web pages returned to the client.

3) And the analysis module is used for distributing different pages to different analyzer examples for analysis according to the URL information of the Response data, acquiring structured data from the different analyzer examples and storing the structured data into the MongoDB database, and storing data related to a new task into the Redis database for the task scheduling module to generate the new task. The Response data refers to the Response data of the application server to the client, so that the data is returned by the server.

4) The task scheduling module generates task content according to basic information pre-configured according to seed tasks and the like, and can also generate new tasks according to data analyzed by the analysis module; and according to the URL of the task, the task is subjected to duplicate removal, and the failure recovery of the task is realized.

5) The data storage module is mainly responsible for storing relevant data and decoupling the functions of the modules. The decoupling is embodied in that the data storage module can record data required to be written by other functional modules and can also enable the other functional modules to read the data required by the other functional modules, so that each functional module does not need to be directly interacted with other functional modules and only needs a shared data storage module.

Further, step 1) is mainly to install and configure the network address, the network port and the security certificate of the broker agent in the acquisition environment, and the whole acquisition process can be started when the application to be acquired accesses any initial page.

Further, the broker agent module in step 2) includes four main functions:

a) because the client application equipment is provided with the security certificate issued by the broker agent module, the broker can intercept HTTPS encrypted traffic of the client and view plaintext contents.

b) When the Request of the client terminal reaches the broker agent, the broker agent module checks whether the URL of the Request meets the condition to be filtered, and if the URL of the Request meets the condition to be filtered, the Request is intercepted and the empty content is directly returned. Otherwise, the Request is forwarded to the target server. For example, a part of HTTP/HTTPS requests are intercepted according to the configured URL regular expression, empty content is returned, the part of HTTP/HTTPS requests comprise CSS files, JavaScript, picture files and other requests which can reduce the collection efficiency and useless flow are intercepted, and the empty content is directly returned, so that the collection efficiency is improved. Because the CSS file is used for rendering graphics, a large amount of computing resources are occupied, in addition, the JavaScript file consumes a large amount of resources for code browser execution, and finally, resource files like pictures and audio occupy a large amount of network bandwidth, so that the acquisition efficiency and the acquisition stability can be greatly improved after filtering and intercepting.

c) When the target server returns the corresponding Response, the broker agent checks whether the URL is the content required to be acquired, and if so, stores the whole Response into a Redis database.

d) When the target server returns the corresponding Response, if the content is an HTML page, the broker agent checks whether the URL of the broker agent is matched with a specific regular expression, and if so, requests the corresponding task from the task scheduling module and injects the task into a < script > tag in the HTML page. And then transmits the Response to the client application program.

Further, the parsing module 3) is configured to take out content to be parsed from the Redis data cache, allocate the content to different parser instances according to URLs of the content, and store the structured data into the MongoDB and information related to next collection, such as URLs and Cookies, in the Redis database cache after parsing is completed.

Further, step 4) the task scheduler: the method mainly comprises task generation, task scheduling, task deduplication and task recovery.

Further, task generation is mainly divided into two generation modes: one for generating tasks according to preconfigured seed information and one for generating new tasks according to already collected information; in addition, tasks are also largely divided into two types: one is a simple task of collecting HTML pages, and the other is data of dynamic information such as JSON (Java Server pages), and the task needs related task parameters, Cookies and other information, and executes JavaScript related codes to finish collection.

Further, task scheduling is mainly to allocate different collection tasks to different application programs according to the different application programs, control the collection rate of the collection tasks, and avoid being prohibited.

Further, task deduplication is mainly based on URLs, and only one unified deduplication queue needs to be used in the Redis database, so that each time a task is generated, whether the URL has been accessed is queried.

Further, the task recovery function needs to identify the task that failed to collect and schedule its recovery in due time.

The invention has the following beneficial effects:

the data acquisition method and the data acquisition system based on man-in-the-middle attack can support the data acquisition task of the application taking the browser kernel as the core, can adopt flexible configuration according to URL regulation, and has modularization and functionalization in the crawling process, thereby greatly improving the efficiency of acquiring the application program data and having good applicability and universality.

In fig. 1, a client application device is a device installed with an application that needs to collect information, and a broker configuration needs to be configured on the device. After the configuration is completed, the man-in-the-middle agent module can be seen to act on all the traffic between the client device and the application server and implement man-in-the-middle attack in due time, and the man-in-the-middle agent module can interact with the task scheduling module and the Redis database. And finally, the acquired data is analyzed by an analyzer and then stored in a MongoDB database.

The data acquisition system based on man-in-the-middle attack comprises five modules in the acquisition process, the processing process is modularized, the functions are simplified, Redis data cache decoupling is adopted among the modules, and the efficiency of the crawler system is greatly improved. The stability of data capture is guaranteed, and meanwhile the operation and maintenance cost of the system is greatly reduced.

The following are system examples corresponding to the above method examples, and this embodiment can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the above-described embodiments.

The system for acquiring webpage information in the internet, wherein the generation process of the acquisition task in the module 2 comprises the following steps: and generating the acquisition task according to the preconfigured seed information, or generating a new acquisition task according to the acquired webpage acquisition result.

The system for acquiring the webpage information in the internet, wherein the module 2 comprises: intercepting part of HTTP/HTTPS requests according to the configured URL regular expression, and returning empty content to improve the collection efficiency.

The system for acquiring webpage information in the internet, wherein the acquisition task in the module 2 comprises: HTML page collection task and dynamic content collection task; the HTML page acquisition task comprises a jump code, and a jump is made to a URL to be acquired next time; the dynamic content collection task not only comprises a jump code, but also comprises a JavaScript code which is used for obtaining corresponding interface parameters and a collected page.

Claims

1. An internet data acquisition method based on a man-in-the-middle is characterized by comprising the following steps:

2. The man-in-the-middle based internet data collection method of claim 1, wherein the step 2 comprises: and the man-in-the-middle decrypts the encrypted content in the network flow according to the HTTPS security certificate configured by the webpage information acquisition equipment.

3. The man-in-the-middle based internet data collection method of claim 1, wherein the collection task generation process in step 2 comprises: and generating the acquisition task according to the preconfigured seed information, or generating a new acquisition task according to the acquired webpage acquisition result.

4. The man-in-the-middle based internet data collection method of claim 1, wherein step 2 comprises: intercepting part of HTTP/HTTPS requests according to the configured URL regular expression, and returning empty content to improve the collection efficiency.

5. The man-in-the-middle based internet data collection method of claim 1, wherein the collection task in step 2 comprises: HTML page collection task and dynamic content collection task; the HTML page acquisition task comprises a jump code, and a jump is made to a URL to be acquired next time; the dynamic content collection task not only comprises a jump code, but also comprises a JavaScript code which is used for obtaining corresponding interface parameters and a collected page.

6. An internet data acquisition system based on a man-in-the-middle, comprising:

7. The man-in-the-middle based internet data collection system of claim 6, wherein the module 2 comprises: and the man-in-the-middle decrypts the encrypted content in the network flow according to the HTTPS security certificate configured by the webpage information acquisition equipment.

8. The man-in-the-middle based internet data collection system of claim 6, wherein the collection task generation process in module 2 comprises: and generating the acquisition task according to the preconfigured seed information, or generating a new acquisition task according to the acquired webpage acquisition result.

9. The man-in-the-middle based internet data collection system of claim 6, wherein module 2 comprises: intercepting part of HTTP/HTTPS requests according to the configured URL regular expression, and returning empty content to improve the collection efficiency.

10. The man-in-the-middle based internet data collection system of claim 6, wherein the collection task in module 2 comprises: HTML page collection task and dynamic content collection task; the HTML page acquisition task comprises a jump code, and a jump is made to a URL to be acquired next time; the dynamic content collection task not only comprises a jump code, but also comprises a JavaScript code which is used for obtaining corresponding interface parameters and a collected page.