CN115967607A

CN115967607A - Template-based distributed internet big data acquisition system and method

Info

Publication number: CN115967607A
Application number: CN202211669718.0A
Authority: CN
Inventors: 韩红; 朱正强; 张森
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-12-25
Filing date: 2022-12-25
Publication date: 2023-04-14

Abstract

The invention discloses a template-based distributed internet big data acquisition system and a template-based distributed internet big data acquisition method, which mainly solve the problems of low system stability, difficult maintenance and inflexible acquisition content in the prior art. The system mainly comprises a system function management unit and a task execution unit. The system function management unit includes: the system comprises a template management module, an agent management module, a task management module and a node management module; the task execution unit includes: the system comprises a main logic processing module, an exception handling module and a post-processing module. The modules respectively complete the functions of template testing and nesting, setting agent request intervals, generating a data acquisition task, monitoring node information, executing the task, processing abnormal tasks and continuously acquiring abnormal tasks. The invention enhances the safety and stability of the system, improves the flexibility and expandability of the acquired content, improves the performance and transportability of the data acquisition system, reduces the maintenance difficulty of the system, and can be used for artificial intelligence, big data and Internet of things.

Description

Template-based distributed internet big data acquisition system and method

Technical Field

The invention belongs to the technical field of big data, and particularly relates to a distributed internet big data acquisition system and method which can be used for artificial intelligence, big data and internet of things.

Background

In recent years, the development of various technologies such as artificial intelligence, block chains, internet of things and the like requires large data support, and the internet is filled with a large amount of data, so that a challenging problem is how to accurately collect available data on the internet. Conventional web data collection is based on web crawler technology, which is an automated process that can capture, extract and collect data on the internet, typically by automatically accessing a web site, capturing the data on the web site, and storing it on a local computer. The web crawler technology mainly realizes three functions: acquiring webpage information and collecting data; analyzing the webpage data and extracting the data; the extracted data is saved to a suitable database. The data captured by the crawler needs to be stored on the local computer for subsequent processing and analysis.

Currently, crawler technology is mainly focused on three aspects: the distributed crawler system is a distributed system, and can decompose one crawler task into a plurality of subtasks and distribute the subtasks to different machines for execution. The distributed crawler technology can improve the efficiency and throughput of the crawler and can avoid the problem of single-point failure; and secondly, the dynamic webpage crawler technology emphasizing content acquisition, wherein the dynamic webpage crawler is a special crawler and can be used for processing the grabbing and extraction of dynamic webpages. Dynamic web crawlers typically use browser simulators to simulate real user behavior to capture data on dynamic web pages; and thirdly, the deep crawler technology emphasizing the depth of the acquired data is deeper than the general web crawler, more webpage information is mined, and deep acquisition can be performed according to the website structure. This technique typically continues to collect all linked pages of a web page as it is collected until the deepest level of the web site is collected. The deep crawler technology can acquire more webpage information and can analyze the structure of the website more comprehensively.

Patent document CN110489698A discloses a system and method for automatically collecting web page data, which adds a script engine module and a process control module based on an embedded browser to realize automatic collection of web page data, but this system does not limit the access frequency of a target website, and does not use an agent, and a reverse crawling mechanism of the website detects the access frequency and IP addresses, and when the website detects that a large amount of requests come from the same IP address in a short time, the requests are regarded as machine behaviors, and are prohibited from accessing the web page data, and further data cannot be collected.

Patent document No. CN110442766A discloses a method, an apparatus, a device and a storage medium for acquiring web page data, in which different script templates are preset for different data acquisition requirements, but the web site structure changes day by day, once the web page structure changes, all relevant templates need to be modified, and the later maintenance is difficult.

In addition, although the above technical solutions can acquire dynamic web page data and collect web page depth data, the collection of content is limited to the current web page or continues to collect link data existing in the current web page, and the flexibility and expandability of the collected content are not high.

Disclosure of Invention

The invention aims to provide a template-based distributed internet big data acquisition system and method, which are used for solving the problems in the prior art, improving the safety and stability of acquired data and improving the flexibility and expandability of acquisition.

In order to achieve the purpose, the technical scheme of the invention is as follows:

1. the utility model provides a big data acquisition system of distributed internet based on template which characterized in that: comprises a system function management unit and a task execution unit

The system function management unit comprises:

the template management module is used for managing each acquired template in the system, including increasing, deleting, modifying, checking, importing and exporting, and providing a template testing function and a template nesting function;

the agent management module is used for acquiring agent management required by the task, including establishment, deletion, import and export of the agent and request interval setting of the agent;

the task management module is used for generating and storing a data acquisition task according to the existing data and related settings of the template management module and the agent management module;

the node management module is used for setting and managing the whole node, monitoring node information and acquiring data for browsing;

the task execution unit includes:

the main body logic processing module is used for executing the tasks generated by the task management module according to the relevant settings;

the exception handling module comprises an agent exception handling submodule and a template exception handling submodule and is used for entering different exception handling branches for handling according to different exception types when a problem occurs in the handling process of the main logic handling module;

and the post-processing module is used for processing and storing subsequent acquired data when the main logic processing submodule executes no exception.

Further, the template management module performs template testing, and firstly creates, imports or modifies a template and configures variables required by the operation of the template; then the configured template information is transmitted to a main logic processing module, and the main logic processing module executes a data acquisition task; after the main body logic processing module is operated, the operation result is returned to the template management module, and the template management module is simply displayed: if the execution is successful, the template management module displays the acquired data information; if the execution fails, the template management module displays the abnormal information, modifies the template again, repeats the whole process until the template can acquire the expected content, and stores the template for later use.

Further, the template management module performs template nesting setting when a template is newly built, namely, the current template and the existing template are called to share task variable setting, and multi-website joint data content acquisition is realized by configuring different acquisition websites between the templates.

Further, the agent management module sets a request interval for the agent, and sets a specific access time interval for each access request according to the given number of requests per second, so that all the requests sent in a single acquisition task obey Poisson distribution, and the request interval obeys exponential distribution; random sampling is then performed on this curve, setting a specified access time interval for each request.

Further, the main logic processing module executes the data acquisition logic of the task, which is a main flow of the acquisition task, that is, the agent configured by the task is used to execute the template task according to the set execution time and the agent request access frequency according to the template of the task and the variable information thereof.

2. The method for collecting big data by using the system is characterized by comprising the following steps:

s1) based on a Docker technology, operating a system to deploy a script file, and deploying a data acquisition system:

deploying the basic Docker service by using a deployment script;

running a data storage node mirror image of the data acquisition system on the basic Docker service to generate a corresponding container so as to deploy the data storage node;

and operating the collection node mirror image of the data collection system, generating a corresponding container, and deploying the collection node.

S2) logging in the data acquisition system, namely entering a system management interface after the data acquisition system is deployed, and performing corresponding operation through all modules of the system function management unit:

s2 a) a template management module, which is used for newly building a data acquisition template or importing the existing template, and can select template testing after the template is built;

s2 b) an agent management module, which is used for building or importing an agent server and setting the request number per second of an agent;

s2 c) the task management module selects an acquisition template to be executed and an agent to be used, sets task operation time, configures task operation variables, operates the acquisition task after the configuration is finished, and sends task information to the task execution unit;

s3) the task execution unit executes a data acquisition task:

after receiving the acquisition task information, the main logic processing module executes the acquisition task according to the relevant configuration of the task, and performs different operations according to the execution condition in the whole execution process:

if the whole process is normally executed, the acquired data is processed and stored through a post-processing module;

otherwise, processing the abnormal problem through an abnormal processing module, and executing the data acquisition task again after the abnormal processing is finished;

and S4) the task execution unit transmits the acquired data to the node management module, and the module displays the task execution log, the node information in the task execution process and all the acquired data.

Compared with the prior art, the invention has the following advantages:

firstly, the template management module of the system can provide a template test function, and the test result can be directly displayed, so that the system is convenient for relevant personnel to check the template problem, improves the working efficiency of the relevant personnel and reduces the maintenance cost; meanwhile, as the module is nested when the template is created, the flexibility of acquiring the content by a single acquisition task is improved, and compared with the traditional data acquisition template in which the content of the same website can only be acquired by the single acquisition task, the invention enlarges the acquisition range of the acquisition task;

secondly, the system of the invention provides a time interval setting mode which obeys exponential distribution through the agent management module, so that the network data access request under the normal condition can be simulated more accurately, the sent request is prevented from being detected by a reverse-stealing mechanism of the target website, the probability of being forbidden by the target website is reduced, and the safety and the stability of the acquisition system are improved;

thirdly, the data storage nodes and the acquisition nodes are deployed by using the deployment script based on the Docker technology, so that the support of all platforms capable of operating the Docker can be obtained, and the portability of the system is greatly improved;

fourthly, when the method of the invention deploys the data acquisition system, the data storage nodes and the acquisition nodes are respectively deployed in the intranet server and the public network server, thereby not only realizing higher data accessibility at lower cost, but also obtaining better network access performance.

Drawings

FIG. 1 is a block diagram of a template-based distributed Internet big data acquisition system of the present invention;

FIG. 2 is a flowchart of the template test operation of the template management module of FIG. 1;

FIG. 3 is a flow chart of an implementation of the template-based distributed Internet big data collection method of the present invention;

FIG. 4 is a schematic diagram of a data acquisition system deployed in the method of the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and examples.

Referring to fig. 1, the system of the present invention includes a system function management unit 1 and a task execution unit 2.

The system function management unit 1 is mainly used for managing templates and agents, configuring data acquisition tasks and displaying node related information, and comprises a template management module 11, an agent management module 12, a task management module 13 and a node management module 14.

The template management module 11 is mainly used for managing each acquired template in the system, supports basic operations of creating, deleting, modifying and inquiring the template, and simultaneously supports exporting the template information in the node to a text file, and other nodes can import the template text file through a template import function, so that the template can be conveniently migrated and used; because the structure of the current website can be changed frequently, the acquisition template corresponding to the website can be invalid occasionally, so that the usability of the template can be tested more conveniently by related workers, the working efficiency is improved, and the template management module provides a rapid template testing function; in order to solve the problem that the acquisition content of single acquisition of the traditional data acquisition task is limited to a single website, the module also provides a template nesting function, namely a template management module carries out template nesting setting when a template is newly built, an execution step to be set is added to the current template, then the template nesting function is selected in the setting of the step, the template nesting function provides a current existing template list, and finally one template is selected from the list for nesting, so that another template is called in one template, the current template and the existing template share task variable setting during operation, multi-website combined acquisition is realized between the templates by configuring different acquisition websites, and the acquisition content range is expanded.

The agent management template 12 is mainly used for managing agents required by acquisition tasks, the operation includes creation, deletion, import and export, and meanwhile, in order to better use the agents to simulate normal network requests, the agent management module supports setting of request time intervals complying with exponential distribution, the setting of the request time intervals is to set a specific access time interval for each access request according to given requests per second, so that all requests sent out in a single acquisition task obey Poisson distribution, the request intervals obey exponential distribution, the specific operation is to calculate average request time intervals according to the requests per second and construct an exponential distribution curve according to the average request time intervals as expected; random sampling is then performed on this curve, setting a specified access time interval for each request.

The task management module 13 is mainly used for configuring an acquisition template, selecting a task agent, specifying a template variable, and setting task execution time, and is specifically implemented as follows: configuring relevant variables required by the template according to the template in the template management module; and selecting available agents in the agent management module for the acquisition task, setting the execution time to be single execution or periodic execution, and finally generating and storing the data acquisition task.

The node management module 14 is mainly used for setting and managing the whole node and monitoring node information, wherein the monitored node information includes disk information, network information, actuator information, task execution log information and all collected data, hardware information such as CPU (central processing unit) use condition, memory use condition and node bandwidth occupation condition during the operation of the node is specifically displayed, and system software information such as task execution success number and total collected data number in a period of time is specifically displayed, and meanwhile, a browsing function of all collected data is provided, so that the information such as the time period, the collected task name and the keyword is conveniently inquired and screened.

The task execution unit 2 is mainly used for executing tasks, and includes a main logic processing module 21, an exception handling module 22, and a post-processing module 23.

The main logic processing module 21 is configured to execute the data collection task transmitted from the task management module 13, is a main flow of the collection task, and executes the template task according to a set execution time and an agent request access frequency by using an agent configured by the task according to a template of the task and variable information thereof.

The exception handling module 22 includes a proxy exception handling sub-module 221 and a template exception handling sub-module 222, and is configured to enter different exception handling branches for handling according to different exception types when a problem occurs in the main logic processing module 21. The agent exception handling sub-module 221, when an agent connection problem occurs, first changes a new available agent to the current task, and then performs a periodic agent connection test on the agent with the problem: if the agent is available, updating the agent state for subsequent use; if the agent still cannot be used after long-time test, discarding the agent and giving corresponding prompt information; the template exception handling sub-module 222 determines whether the data is an agent problem when the data cannot be collected: if the problem is an agent problem, the problem is processed through the agent abnormity processing sub-module 221, and if the problem is not an agent problem, an abnormity log is output to remind related personnel to modify the template.

The post-processing module 23 is configured to perform subsequent data processing, such as data cleaning and data formatting, and store the acquired data when the main logic processing module executes no exception.

Referring to fig. 2, the template management module 11 performs the template test as follows: ,

firstly, newly building, importing or modifying a template, and configuring variables required by the operation of the template;

then, the configured template information is transmitted to the main body logic processing module 21, and the main body logic processing module 21 executes a data acquisition task;

finally, after the main logic processing module 21 finishes the operation, the operation result is returned to the template management module 11, and the template management module simply shows the operation result according to the execution condition: if the execution is successful, the template management module displays the acquired data information; if the execution fails, the template management module displays the abnormal information and can revise the template again;

and repeating the whole process until the template can acquire the expected content, and storing the template for later use.

Referring to fig. 3, the method for acquiring big data by using the system of the present invention includes the following steps:

step 1: based on the Docker technology, the running system deploys the script file and deploys the data acquisition system.

Docker is software developed and implemented by using the Go language introduced by Google, which is based on Linux kernel to package and isolate processes and belongs to virtualization technology at the operating system level. The mirror image of the Docker provides a complete running environment except the kernel, the consistency of the running environment of the application is ensured, the application running on one platform can be easily transferred to another platform, and the situation that the application cannot run normally due to the change of the running environment is not worried about. Continuous integration, continuous delivery, deployment can be achieved by customizing application images using Docker.

Referring to fig. 4, this step is implemented as follows:

1.1 Use deployment scripts to deploy the base Docker service;

1.2 Run the data storage node image of the data acquisition system on the basic Docker service, and generate a corresponding container to deploy the data storage node. One data acquisition system can comprise one or more data storage nodes, each data storage node comprises each storage service, and the data storage nodes are deployed by using an intranet server so as to realize higher data accessibility at lower cost;

1.3 Run the collection node image of the data collection system, generate the corresponding container, and deploy the collection node. A data acquisition system may include one or more acquisition nodes, each acquisition node in the system may run different parts of the same task, or may run different tasks directly, and the acquisition nodes are generally deployed in a public network server such as a cloud server to obtain better network access performance.

The data storage nodes and the acquisition nodes can also be simultaneously deployed on an intranet or public network machine to expand the use scene of the system.

Step 2: and configuring a data acquisition task.

After the data acquisition system is deployed, entering a system management interface, and performing corresponding operations through all modules of the system function management unit 1:

2.1 Preparing a data acquisition template: newly building a data acquisition template or importing the existing template through the template management module 11, and selecting a template for testing after the template is built;

2.2 Preparation of acquisition task agent): a proxy server is newly built or imported through a proxy management module 12, and the request number per second of the proxy is set;

2.3 Configuration data collection task: selecting an acquisition template to be executed and an agent to be used through the task management module 13, setting task operation time, configuring task operation variables, operating an acquisition task after configuration is completed, and sending task information to the task execution unit 2;

and step 3: the task execution unit 2 executes a data collection task.

After receiving the collection task information, the main logic processing module 21 executes the collection task according to the relevant configuration of the task, and performs different operations according to the execution conditions in the whole execution process:

if the whole process is normally executed, the acquired data is processed and stored through the post-processing module 23;

otherwise, the exception problem is processed through the exception handling module 22, and the data acquisition task is executed again after the exception handling is finished;

and 4, step 4: and checking the task execution information.

The task execution unit 2 transmits the collected data to the node management module 14, which displays the task execution log, the node information of the task execution process, and all the collected data.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention disclosed herein are intended to be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. The utility model provides a big data acquisition system of distributed internet based on template which characterized in that: the system comprises a system function management unit (1) and a task execution unit (2);

the system function management unit (1) comprises:

the template management module (11) is used for managing each acquired template in the system, including increasing, deleting, modifying, importing and exporting, and providing a template testing function and a template nesting function;

the agent management module (12) is used for acquiring agent management required by tasks, including the establishment, deletion, import and export of agents and the request interval setting of the agents;

the task management module (13) is used for generating and storing a data acquisition task according to the existing data and related settings of the template management module and the agent management module;

the node management module (14) is used for setting and managing the whole node, monitoring node information and acquiring data for browsing;

the task execution unit (2) comprises:

a main body logic processing module (21) for executing the task generated by the task management module according to the relevant setting;

the exception handling module (22) comprises an agent exception handling sub-module (221) and a template exception handling sub-module (222) and is used for entering different exception handling branches according to different exception types for handling when a problem occurs in the handling process of the main logic handling module;

and the post-processing module (23) is used for processing and storing subsequent acquired data when the main logic processing module executes no exception.

2. The system of claim 1, wherein: the template management module (11) carries out template test, namely, a template is newly built, imported or modified, and variables required by the operation of the template are configured; then the configured template information is transmitted to a main logic processing module to execute a data acquisition task; after the theme logic processing module is operated, the operation result is returned to the template management module, and the template management module simply shows the following steps:

if the execution is successful, the template management module displays the acquired data information;

if the execution fails, the template management module displays the abnormal information, modifies the template again, repeats the whole process until the template management module acquires the expected content, and stores the template for later use.

3. The system of claim 2, wherein: the template management module (11) carries out template nesting setting when a template is newly built, namely, the current template and the existing template are called to share task variable setting, and multi-website joint data content acquisition is realized between the templates by configuring different acquisition websites.

4. The system according to claim 1, characterized in that the agent management module (12) performs request interval setting on the agent by setting a specific access time interval for each access request according to a given number of requests per second, so that all issued requests in a single acquisition task are subjected to poisson distribution, the request intervals are subjected to exponential distribution, and the implementation steps are firstly calculating an average request time interval according to the number of requests per second, and constructing an exponential distribution curve for the requests at the interval; random sampling is then performed on this curve, setting a specified access time interval for each request.

5. The system of claim 1, wherein: the data acquisition task constructed by the task management module (13) comprises configuration of an acquisition template, specification of a template variable and setting of task execution time.

6. The system of claim 1, wherein: the node information monitored by the node management module (14) comprises disk information, network information, actuator information, task execution log information and acquired data.

7. The system according to claim 1, wherein the main logic processing module (21) executes the data collection logic of the task, which is a main flow of the collection task, that is, the data collection task is executed according to the set execution time and the access frequency of the agent request by using the agent configured by the task according to the template of the task and the variable information thereof.

8. The system according to claim 1, wherein the agent exception handling submodule (221) is configured to, when an agent connection problem occurs, replace a new available agent for the current task, and then perform a periodic agent connection test on the agent with the problem:

if the agent is available, updating the agent state for subsequent use;

if the agent still cannot be used after long-time testing, the agent is discarded and corresponding prompt information is given.

9. The system of claim 1, wherein the template exception handling sub-module (222), when no data is collected, determines whether it is an agent problem:

if the problem is a proxy problem, the proxy exception processing module (221) is used for processing,

if the problem is not the proxy problem, outputting an abnormal log to remind related personnel to modify the template.

10. A method for acquiring big data by a template-based distributed Internet big data acquisition system is characterized by comprising the following steps:

deploying the basic Docker service by using a deployment script;

running a data storage node mirror image of the data acquisition system on the basic Docker service, and generating a corresponding container to deploy the data storage node;

S2) logging in the data acquisition system, namely entering a system management interface after the data acquisition system is deployed, and performing corresponding operation through all modules of the system function management unit (1):

s2 a) a template management module (11) for newly building a data acquisition template or importing the existing template, and optionally testing the template after the template is built;

s2 b) an agent management module (12) for creating or importing an agent server and setting the request number per second of the agent;

s2 c) a task management module (13) selects an acquisition template to be executed and an agent to be used, sets task operation time, configures task operation variables, operates an acquisition task after configuration is completed, and sends task information to a task execution unit (2);

s3) the task execution unit (2) executes a data acquisition task:

after receiving the collection task information, the main logic processing module (21) executes the collection task according to the relevant configuration of the task, and in the whole execution process, carries out different operations according to the execution condition:

if the whole process is normally executed, the acquired data is processed and stored through a post-processing module (23);

otherwise, the exception problem is processed through the exception handling module (22), and the data acquisition task is executed again after the exception handling is finished;

and S4) the task execution unit (2) transmits the acquired data to the node management module (14), and the module displays the task execution log, the node information of the task execution process and all the acquired data.