CN107317724B

CN107317724B - Data acquisition system and method based on cloud computing technology

Info

Publication number: CN107317724B
Application number: CN201710416326.6A
Authority: CN
Inventors: 刘刚; 谭焕云; 姜志刚; 黄元庆; 张振海
Original assignee: China Securities Credit Investment Co Ltd
Current assignee: China Securities Credit Investment Co Ltd
Priority date: 2017-06-06
Filing date: 2017-06-06
Publication date: 2020-12-11
Anticipated expiration: 2037-06-06
Also published as: CN107317724A

Abstract

The invention discloses a data acquisition system and a method based on a cloud computing technology, wherein the system adopts a distributed hierarchical cooperation and horizontally expandable asynchronous queue scheme and comprises a task scheduler, a generator, a downloader and a resolver; the task scheduler schedules the generator, the downloader and the parser according to each data acquisition task to acquire data related to each data acquisition task; the generator generates a URI of each website related page to be acquired corresponding to the data acquisition task according to the scheduling of the task scheduler; the downloader is used for downloading the original data corresponding to the URI of each website related page to be acquired according to the scheduling of the task scheduler; the parser is used for parsing the original data downloaded by the downloader into structured data according to the scheduling of the task scheduler. The acquisition system can realize distributed elastic expansion through a quick deployment function according to the task amount and the load condition, and quickly improve the load capacity of the system.

Description

Data acquisition system and method based on cloud computing technology

Technical Field

The invention relates to the field of network information acquisition, in particular to a data acquisition system and method based on a cloud computing technology.

Background

With the rapid development of internet technology, the amount of information contained in the internet technology is rapidly increasing in an exponential trend, so that people need to spend a great deal of energy on searching for needed information in a large amount of information, and therefore the desire of people for obtaining the needed information at any time and any place is stronger. By adopting the search engine, people can be helped to quickly find out the desired information. However, the results returned by the current search engine often contain a large amount of irrelevant content, and in the prior art, a web crawler is also adopted for collecting and grabbing resources. Although the current web crawler technology has been developed rapidly and various web data acquisition platforms have come up, most websites are continuously modified, anti-piracy measures are continuously updated, and the web data acquisition technology still needs to be discussed and improved. The existing crawler tool is often specially developed according to specific application environment and site characteristics, is difficult to make quick response to changes such as website version change, environment change, demand expansion and the like, has poor expansibility, and cannot meet different demands of different users. Particularly for companies relying on data and having public opinion monitoring requirements, the data collection instantaneity directly determines the effectiveness and value of the data, meanwhile, massive sites need to be monitored, the occurrence, the process and the result of an event are tracked, the traditional web crawler is observed, the personnel requirements are high, the development speed is low, the general applicability is low, the development, operation and maintenance cost is high, and the requirements of massive monitoring, high instantaneity and large data volume cannot be met.

Disclosure of Invention

Aiming at the defects of the prior art, the technical problems to be solved by the invention are as follows: the distributed data acquisition system and method based on the cloud computing technology are configurable, high in stability, capable of expanding system performance of a minute-level transverse adding machine, built-in with a multi-task system and supporting flexible scheduling and one-key deployment and operation and maintenance of task levels. In addition, the behavior of the acquisition layer, the analysis layer and the storage layer is monitored and managed through a visual interface, the data acquisition frequency is effectively controlled, and the running state of the data acquisition system is monitored.

In order to solve the technical problems, the invention adopts a technical scheme that: the distributed data acquisition system based on the cloud computing technology comprises a task scheduler, a generator, a downloader, a parser and a plurality of characteristic plug-ins;

the task scheduler is used for scheduling the generator, the downloader and the parser according to each data acquisition task so as to acquire data related to each data acquisition task;

the generator is used for generating the URI of each website to be acquired related page corresponding to the data acquisition task according to the scheduling of the task scheduler;

the downloader is used for downloading the original data corresponding to the URI of each website related page to be acquired according to the scheduling of the task scheduler;

the analyzer is used for analyzing the original data downloaded by the downloader into structured data according to the scheduling of the task scheduler and exporting the structured data to the data storage module.

The feature plug-in comprises functional modules with special purposes, such as verification code recognition, IP proxy, OCR recognition, NLP processing and the like.

Further, the system further comprises a generator self-defining module, which is used for a user to self-define a corresponding generator generation rule in advance according to the characteristics of the website page to be acquired, so that the generator can generate the URI of the website to be acquired according to the corresponding rule.

Further, the system comprises a downloader self-defining module used for a user to self-define a corresponding downloader rule in advance according to the characteristics of the website to be acquired, so that the downloader can download the public data corresponding to the URI of the website to be acquired according to the corresponding rule.

Further, the system further comprises a parser self-defining module, which is used for a user to self-define a corresponding parser parsing rule in advance according to the characteristics of the website page to be collected, so that the parser can parse the public data of the website to be collected according to the corresponding rule.

The system further comprises a data storage module, a data processing module and a data processing module, wherein the data storage module is used for storing the self-defined rule related to the data acquisition task; the server is also used for storing the URI of each website to be collected related page generated by the generator and related to the data collection task for the downloader to call; the system is also used for storing the original data downloaded by the downloader and of each website to be acquired for the parser to call; and the system is also used for storing the structured data analyzed by the analyzer.

Further, the system comprises an acquisition layer, a structural layer and a storage layer, wherein the storage layer comprises the data storage module, the acquisition layer comprises the generator and the downloader, and the structural layer comprises the parser and the anti-replay device.

The system layer is used for managing each data acquisition task and subtasks, system management and authority management corresponding to each website, the system management comprises plug-in management, machine management, agent management, storage management, alarm management, OCR management and NLP management, the authority management comprises role authority management, user management, group authority, authority list and operation log management, and the subtask management is used for visually checking data such as progress, error logs and the like of each subtask so as to monitor operation and maintenance conveniently.

Furthermore, the task scheduler schedules the generator, the downloader and the parser for the corresponding website acquisition tasks according to the priority order of the websites related to the data acquisition tasks from first to second, completes the whole process, the scheduling and the control, and completes the data acquisition of the corresponding websites in sequence from first to second, thereby realizing the scheduling and the data acquisition of the corresponding data acquisition tasks.

Further, the task scheduler includes a generation task scheduler, which is configured to obtain scheduling configuration information generated by a URI of each website page to be acquired, which is related to the data acquisition task, to obtain a scheduling time sequence of each website to be acquired, and activate distribution work according to the scheduling time sequence, so that a corresponding generator is inserted into a task distribution queue for each website page to be acquired in sequence from beginning to end, and a process is started to execute the corresponding generator, so that the corresponding generator corresponds to the URI of the website according to a generation rule.

Further, the task scheduler comprises a download task scheduler, which is used for acquiring scheduling configuration information downloaded by each website page to be acquired related to the data acquisition task so as to acquire a scheduling time sequence of each website page to be acquired, and activating distribution work according to the scheduling time sequence, so that a corresponding downloader is sequentially inserted into each website page to be acquired from beginning to end to a download task distribution queue, and a process is started to execute the corresponding downloader, so that the corresponding downloader can be used for downloading original data related to the URI of the corresponding website page.

Further, the task scheduler includes an analysis task scheduler, configured to obtain scheduling configuration information for downloading data analysis of each website page to be acquired, which is related to the data acquisition task, so as to obtain a scheduling timing sequence of each website page to be acquired, activate distribution work according to the scheduling timing sequence, and start a process to execute a corresponding analyzer, so that the corresponding analyzer is analyzed into original data of a corresponding website.

In order to solve the technical problems, the invention adopts a technical scheme that: the website public data acquisition method comprises the following steps:

loading a data acquisition task;

scheduling a generator according to the data acquisition task to generate a URI of each website page to be acquired corresponding to the data acquisition task;

scheduling a downloader according to the generated URI of the website pages to be acquired so as to download original data corresponding to the URI of each website page to be acquired;

and scheduling a parser according to the original data corresponding to the URI corresponding to each website page to be collected downloaded by the downloader so as to parse and structurally store the public data downloaded by the downloader.

Further, before the step of acquiring the data collection task, the method further includes:

and configuring corresponding generator parameters or generating rule scripts according to the characteristics of the website to be acquired, so that the generator can generate the URI of the website page to be acquired according to the configured parameters or the generating rules.

and configuring corresponding downloader parameters according to the characteristics of the website to be acquired and the data acquisition requirements, so that the downloader can download the public data corresponding to the URI of the website page to be acquired according to the configured parameters.

and configuring corresponding analyzer parameters or analysis rule scripts and data export configuration scripts according to the characteristics of the website to be acquired, so that the analyzer can analyze the public data of the website page to be acquired according to the configured parameters or analysis rules and store the public data in a structured manner.

The data acquisition system and method based on the cloud technology are designed and finished by a distributed hierarchical cooperation and horizontally extensible asynchronous queue scheme, support various types of data acquisition, support multithreading, multiprocess operation and one-key deployment operation and maintenance in a single machine, support custom crawler script configuration, are a one-stop type general crawler development and data acquisition platform, can support P-level data capture and minute-level update granularity, and allow developers to add extension special plug-ins. On one hand, the system helps developers to quickly construct a powerful data acquisition system, on the other hand, the data acquisition efficiency and expandability are greatly improved, and the system can meet the system application requirements of high real-time requirement, complicated acquisition websites and large data acquisition amount. And the complexity of coupling and a system can be reduced, a specific layer can be positioned, error tracking and customized alarm prompting are facilitated, and the stability of the system is further improved. The system can realize distributed elastic expansion through a quick deployment function according to the task amount and the load condition, and quickly improve the load capacity of the system.

Drawings

Fig. 1 is a block diagram of an embodiment of a data acquisition system based on cloud computing technology according to the present invention.

Fig. 2-1 is a block diagram of an embodiment of a data acquisition system based on cloud computing technology according to the present invention.

Fig. 2-2 is a hierarchical division diagram of an embodiment of the data acquisition system based on cloud computing technology.

FIG. 3 is a generator design diagram in an embodiment of the data acquisition system based on cloud computing technology.

Fig. 4 is a design diagram of a downloader in an embodiment of the data acquisition system based on the cloud computing technology.

FIG. 5 is a design diagram of a parser in an embodiment of a data collection system based on cloud computing technology.

FIG. 6 is a diagram of a memory design in an embodiment of a data acquisition system based on cloud computing technology.

FIG. 7 is a flowchart illustrating a website data collection method according to an embodiment of the present invention.

FIG. 8 is a flowchart illustrating a website data collection method according to another embodiment of the present invention.

FIG. 9 is a flowchart illustrating generator scheduling in an embodiment of the website data collection method of the present invention.

FIG. 10 is a flowchart of URI generation for a website data collection method of the present invention.

FIG. 11 is a flowchart illustrating scheduling of a downloader in an embodiment of the website data collection method of the present invention.

FIG. 12 is a flowchart illustrating downloading according to an embodiment of the website data collection method of the present invention.

FIG. 13 is a flow chart of an early warning mechanism in an embodiment of a website data collection method of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the data acquisition system based on the cloud computing technology of the present invention includes a task scheduler, a generator, a downloader, and a parser. Some feature plug-ins such as verification code Recognition, IP proxy, OCR (optical Character Recognition), NLP (Natural Language Processing) Processing, etc. can be optionally included. Users (enterprises or individuals) using the data acquisition system can also select to customize other feature plug-ins according to own requirements. Wherein:

the task scheduler is used for completing scheduling and control of the whole processes such as generation, downloading and analysis corresponding to each data acquisition task, and the task scheduler is specifically used for scheduling the generator, the downloader and the analyzer according to each data acquisition task so as to acquire data related to each data acquisition task.

Each data acquisition task (JOB) at least corresponds to one website to be acquired, and each website to be acquired corresponds to at least one sub-acquisition task.

And the generator is used for generating the URI of each website to be acquired related page corresponding to the data acquisition task according to the scheduling of the task scheduler. The generator can include, but is not limited to, a Python generator, a Shell generator, etc., and also supports URI list file import, which is mainly designed according to the acquisition requirements and the characteristics of different websites. It is emphasized that the term "web site features" as used herein includes web page acquisition modes (GET, POST, etc.), authentication codes, encoding formats, IP block or not, JS, Cookies, etc.

And the downloader is used for downloading the public data corresponding to the URI of each website related page to be acquired according to the scheduling of the task scheduler. The downloader obtains and stores the website public data according to the URI generated by the generator, and the downloader comprises but is not limited to the following components corresponding to the type of the existing mainstream website: HTTP downloader, HTTPS downloader, dynamic page downloader, IP camouflage downloader.

The analyzer is used for analyzing the public data downloaded by the downloader into structured data according to the scheduling of the task scheduler. The analyzer is used for analyzing the original data downloaded by the downloader, converting the unstructured data into formatted data, storing the formatted data and supporting data export and storage in the data storage module. The parser includes but is not limited to some standard parsers for different data sources of files of types such as HTML, JSON, PDF, Image and Binary.

According to the embodiment of the invention, the whole process and function definition of website data acquisition and crawling are modularized, each independent module is mutually cooperated, the division of labor is clear, and the whole process is monitored through generation, downloading and analysis of the task scheduler.

Referring to fig. 2-1, fig. 2-1 is a block diagram of an embodiment of a data collection system based on cloud computing technology according to the present invention. The data acquisition system based on the cloud computing technology of the embodiment comprises a task scheduler, a generator, a downloader and a parser which have the same or similar structures or functions as the above embodiments. The embodiment further includes a data storage module, where the data storage module is configured to store, for the downloader to call, the URI of each page related to the website to be acquired, where the page is generated by the generator and related to the data acquisition task; the system is also used for storing the public data downloaded by the downloader and the websites to be collected for the resolver to call; and the system is also used for storing the structured data analyzed by the analyzer.

The system architecture consists of 3 parts: scheduler, Worker and database, as shown in the following table:

in this embodiment, the data storage module includes a montoddb cluster service, a MySQL cluster service, an NFS (Network File System), and large and small files, stores in business and in fragment, supports horizontal expansion, and has a capacity that can be expanded to a P level, records File positions in detail, and facilitates retrieval.

After the generator generates the URI, the generator stores the generated URI into the database: the generator drives task distribution through the crontab, a generator task function is distributed into an asynchronous queue according to a scheduling strategy, a worker (namely, a generator, a downloader and an analyzer are collectively called, and the same shall apply hereinafter) of the asynchronous queue task grabs the task, executes a generator script, performs anti-refiltering on the URI generated by an execution result, and then stores the final URI into a database. All logs in the process are stored in a generated log, an error log and an alarm log respectively.

The downloader is driven by a crontab timing command to obtain a URI newly generated by the URI generator, a downloader task function and the URI are distributed to an asynchronous queue together according to a scheduling strategy, a worker captures and executes a task from the asynchronous queue, an execution result is stored in a database, and all logs in the process are respectively stored in a generated log, an error log and an alarm log. And when the downloader works, the URI stored in the database by the generator is obtained from the data.

When the analyzer analyzes the downloaded original data, the original data downloaded by the downloader and stored in the database is structurally analyzed, target data is extracted, and an analysis result is stored in the database.

The invention adopts an asynchronous queue implementation mode, and the result data executed by the generator, the downloader and the parser are all stored in the database. Therefore, the coupling of each functional module is reduced, the task scheduler is responsible for resource allocation of the distributed multi-task system, the distributed cooperation function is convenient for realizing distributed elastic expansion, and the load capacity of the system is rapidly improved. A crawler framework with distributed hierarchical cooperation, horizontal expansion and asynchronous queues is constructed, the coupling degree of the functional modules is reduced, the system can rapidly deploy the functional modules through a cloud computing technology according to the task amount and the load condition to achieve distributed elastic expansion, and therefore the load capacity of the system is rapidly improved. Even if the worker of a certain node is abnormal, the work of other workers is not influenced.

The invention also discloses a data acquisition system based on cloud computing technology in another embodiment, in this embodiment, the data acquisition system includes a task scheduler, a generator, a downloader and a parser which are the same as or similar to the structure or function of the above embodiment, and optionally includes a database which is the same as or similar to the structure or function of the above embodiment, and further includes: a generator customization module, a parser customization module, and/or a downloader customization module. Specifically, the method comprises the following steps:

the generator self-defining module is used for a user to self-define a corresponding generator generating rule in advance according to the characteristics of the website page to be acquired, so that the generator can generate the URI of the website to be acquired according to the corresponding rule.

The self-defining module of the generator is used for configuring each user (enterprise, individual and the like) using the data acquisition system according to respective conditions, characteristics of websites needing to be acquired and the like, so that the generator is customized, and compared with the traditional mode that the crawler cannot modify and needs to correspond to a generation script for each website needing to be crawled, the scheme only needs to adjust some variables, parameters or self-define generation rule scripts in the generator, does not need to be developed additionally on a large scale, and is time-consuming and labor-consuming.

In this embodiment, the data storage module is further configured to store a custom rule related to the data collection task.

FIG. 3 is a design drawing of a generator that supports a python generator, a Shell generator, and a manual entry URI, among others. The expression form of the generator self-defining module can be a python generator, a Shell generator and the like shown in the figure, and a crawler developer selects a corresponding generator type according to development requirements and characteristics of a website by referring to a development document and a development example. And according to the characteristics of the generator, modifying and configuring the parameters in the generator or customizing the parameters to generate a rule script, so that the rule script is customized to a customized generator according with the self condition. If the manual input of the URI type is selected, only the URI file needs to be uploaded. When a user (crawler developer) configures generator parameters or self-defines a generator, the user firstly investigates a target website to be crawled, mainly judges the characteristics of the target website and the format of data, and then configures the generator, namely self-defines a generator rule, so that a URI of a target website page can be generated according to the rule in the subsequent generation process, and the URI is output in a fixed format. The common format is as follows:

the URI manually input by a user or the uploaded Python and Shell files are transmitted to the database mongodb; and (3) driving task distribution through the crontab, distributing a generator task function into an asynchronous queue according to a scheduling strategy, and grabbing a task by a worker of the asynchronous queue task and executing a generator script. In this embodiment, the generator further includes a validity checking module, a first duplicate prevention device, a download node management module, and the like. The validity detection module is used for detecting the validity of the custom generator URI generation script and determining whether the generated URI meets the rules, such as detecting whether the URI carries HTTP and the like. The first anti-duplicate device is used for filtering repeated URIs, so that two identical URIs are prevented from being generated, and system resources are prevented from being wasted. The download node manages a download node for assigning a URI, and manages a node carrying a downloader function, which is equivalent to a systematic function.

In this embodiment, the generator is further configured to store all logs in the process of generating the URI, including but not limited to a generation log, an error log, an alarm log, and the like.

The general website collects tasks, and the downloader can download data normally only by adopting default configuration or simple configuration. For some special websites, the default downloader cannot download its data, and the downloader module needs to be customized or re-developed. And the downloader custom module is used for customizing a corresponding downloader by a user according to the characteristics of the website to be acquired in advance, so that the custom downloader can download the data corresponding to the URI of the website to be acquired according to a corresponding rule. Including but not limited to an HTTP downloader, an HTTPs downloader, a dynamic page downloader, an IP masquerading downloader, a customized downloader, and the like, which typically does not require configuration. The customized configuration of the downloader is generally realized by clicking one of the downloaders in fig. 4 according to the requirement to modify the configuration parameters, and if some complex websites are encountered, the customized downloader needs to be clicked to perform deep secondary development according to the development document and the development example.

In this embodiment, the downloader further includes a second duplicate prevention device, which is configured to perform second duplicate removal on the same URI in the database, and the design is directed to: because the embodiment adopts a distributed cooperation and horizontally expandable asynchronous queue scheme, the first duplicate prevention device in the generator only aims at the repeated URI in the task corresponding to one website, and the tasks stored in the database often correspond to more than one website, so that a plurality of tasks may have the same URI of the same website, and the downloader needs to duplicate the same URI again when downloading.

The analyzer self-defining module is used for a user to self-define a corresponding analyzer analyzing rule in advance according to the data characteristics of the website page to be acquired, so that the analyzer can analyze the public data of the website to be acquired according to the corresponding rule. In this embodiment, the parser includes an HTML parser, an NLP parser, a PDF parser, a JSON parser, an OCR parser, a customized parser, and the like. And after clicking the corresponding resolver according to the self requirement, clicking the functional button of the customized resolver to enter the configuration resolver parameters or the customized resolution rules. In other embodiments, the user may also directly click on the corresponding parser to directly enter to configure the parser parameters.

The analyzer is specially used for analyzing the original data downloaded by the downloader, converting the unstructured data into formatted data and storing the formatted data in the database, and meanwhile, the analyzer also supports exporting the data into files in other formats for output.

The analyzer analyzes the content according to the downloading result of the downloader, and the basic key point of the analysis is that the page source code downloaded by the downloader is analyzed by using a Beautiful packet of Python or a regular matching method and the like, and the required structured data is output. The user can configure the parser according to different needs of the user and the characteristics of the website, so that the parser outputs corresponding structured data according to the configured parser rule. Such as rules that parse the downloaded data of HTML-type web sites into a particular format. In general, JSON-like data generally requires no rules, while HTML types require certain parsing rules. Configuration of the export script: and exporting the analysis result of the analyzer to a database, wherein the analysis result mainly comprises the setting of a target database, the configuration information of a corresponding table and the mapping relation between the data and the database table field.

As an alternative, referring to fig. 2-2, the data acquisition system of the present embodiment can be divided into an acquisition layer, a structured layer, a storage layer, and a system layer, and further includes a display layer. The acquisition layer is used for acquiring source data and providing the source data for the structured layer to perform next processing, and by combining a cloud computing technology, the acquisition layer can support the expansion of more than 1000 machines, support main stream URI (such as HTTP and HTTPS), expand a new URI protocol, have a comprehensive anti-crawler module (dynamic page grabbing, verification code identification, automatic proxy switching, random request head disguising and the like), and support functions of error tracking, error and failure re-grabbing, alarm prompt, threshold value configuration and the like. The acquisition layer mainly comprises the generator and the downloader:

the generator is as shown in fig. 3, firstly, the URI generated by the file such as Python, Shell and the like manually input or uploaded by the user is stored in the MongoDB; and the generator task function is distributed into the asynchronous queue according to a scheduling strategy by driving task distribution through the crontab, a worker of the asynchronous queue task grabs the task, executes a generator script, performs anti-refiltering on the URI generated by an execution result, and then stores the final URI into a database. All logs in the process are stored in a generated log, an error log and an alarm log respectively.

The downloader is shown in fig. 4, and the downloader is composed of an HTTP \ HTTPs downloader, a customized downloader, an IP masquerading downloader, and a dynamic page downloader.

The structured layer supports most format source data (including HTML, JSON, PDF, pictures and the like), natural language processing and picture file structuring (OCR), horizontal extension of a parser node is supported, and data deduplication is achieved. And a plug-in interface is reserved, developers can write plug-ins conveniently according to the interface to analyze specific source data, analysis result logs, error logs and alarm logs are realized, and analysis rule configuration, log configuration, scheduling configuration and fault-tolerant strategy configuration are realized. The core comprises a parser and a second deduplication module:

a resolver: as shown in fig. 5, there are system plug-ins and third party plug-ins, where the system plug-ins have some standard parsers for different data sources of HTML, JSON, PDF, Image and Binary type files. The third party plug-in provides an interface for operators to develop some custom-scheduled resolvers, and some standard plug-ins can be developed according to different data sources for a specific type of resolver. The resolver is equivalent to a worker of the monitor and is assigned to complete a specific resolving task, for example, an HTML page extracts values of some fields and stores the values in a target database, and once an error occurs, an error log is recorded and the monitor is informed before the crash.

The weight guard: the de-duplicator solves the problem that the multiple structuring of the same data source consumes the performance of the system. Data deduplication is achieved by configuring deduplication rules, for example, MD5 or Sha-256 values can be configured to determine whether the data is the same data, and data expiration time can be configured for the same data to update the data.

Referring to fig. 6, the storage layer stores the above database, including the montodb cluster service, MySQL cluster service, NFS, large and small files, and stores in business and in pieces, supporting horizontal expansion, capacity can be expanded to P level, recording file position in detail, and facilitating retrieval.

The system layer mainly realizes the modularization and the pluging of main functions of the system and is convenient for the clone transplantation of the system. The system management comprises plug-in management, machine management, agent management, storage management, alarm management, OCR management, NLP management and the like, the operation is simple, the operation is easy, the operation is convenient, each module can be independently and deeply expanded, and the development cost is greatly reduced. The authority management comprises 5 content modules of role-calling authority management, user management, group authority, an authority list and an operation log, and can display operable functions and viewable data according to the authority; the subtask management can visually check data such as progress and error logs of each subtask, is convenient for monitoring operation and maintenance, and mainly comprises a configuration table, a subtask cluster list and single subtask content.

The practical application scenario of this embodiment may be: the data collection system provider provides the modeled (or standardized) data collection system and deployment scenario for each user (including enterprise, individual, etc.) with public data collection requirements. After each user deploys the system, the system is used for collecting data, and before the data is collected, custom rules or custom generator, downloader and/or parser script parameters can be carried out on a generator, a downloader and a parser and the like of the data collection system according to respective use conditions or requirements (mainly according to the characteristics of websites needing to be collected). And then crawling the data of the website to be acquired according to the corresponding self-defined rule.

In the embodiment, the distributed hierarchical cooperation and horizontally extensible asynchronous queue scheme is designed, multiple types of data acquisition is supported, multithreading, multiprocess running and one-key deployment operation and maintenance in a single machine are supported, custom crawler script configuration is supported, the platform is a one-stop general crawler development and data acquisition platform, P-level data capture and minute-level updating granularity can be supported, and developers are allowed to add extension special plug-ins. On one hand, the system helps developers to quickly construct a powerful data acquisition system, on the other hand, the data acquisition efficiency and expandability are greatly improved, and the system can meet the system application requirements of high real-time requirement, complicated acquisition websites and large data acquisition amount. And the complexity of coupling and a system can be reduced, a specific layer can be positioned, error tracking and customized alarm prompting are facilitated, and the stability of the system is further improved. The system can realize distributed elastic expansion through a quick deployment function according to the task amount and the load condition, and quickly improve the load capacity of the system.

Preferably, when a data acquisition task needs to acquire a plurality of websites to be acquired, the websites to be acquired need to be prioritized, so that the task scheduler schedules the websites from first to last according to the priorities. Specifically, the task scheduler schedules the generator, the downloader and the parser for the corresponding website acquisition tasks according to the priority order of the websites related to the data acquisition tasks from first to last, completes the overall process, the scheduling and the control, and completes the data acquisition of the corresponding websites in sequence from first to last, thereby realizing the scheduling and the data acquisition of the corresponding numerous data acquisition tasks. The task scheduler may be subdivided into a generate task scheduler, a download task scheduler, a parse task scheduler, etc. Wherein:

the generation task scheduler is used for acquiring scheduling configuration information generated by the URI of each website page to be acquired related to the data acquisition task so as to acquire the scheduling time sequence of each website to be acquired, and activating distribution work according to the scheduling time sequence, so that a corresponding generator is inserted into a task distribution queue for each website page to be acquired in sequence from beginning to end, and a process is started to execute the corresponding generator, so that the corresponding generator corresponds to the URI of the website according to a generation rule.

The download task scheduler is used for acquiring scheduling configuration information downloaded by each website page to be acquired related to the data acquisition task so as to acquire a scheduling time sequence of each website page to be acquired, and activating distribution work according to the scheduling time sequence, so that a corresponding downloader is sequentially inserted into a download task distribution queue for each website page to be acquired from beginning to end, and a process is started to execute the corresponding downloader so as to enable the downloader to be used for downloading public data related to the URI of the corresponding website.

The analysis task scheduler is used for acquiring scheduling configuration information of each website page download data analysis to be acquired related to the data acquisition task so as to acquire a scheduling time sequence of each website page to be acquired, activating distribution work according to the scheduling time sequence, and starting a process to execute a corresponding analyzer so as to enable the analyzer to analyze the public data of the corresponding website.

Please refer to fig. 9 to 12 for the work flow of each task scheduler, which is not described in detail herein.

Specifically, in combination with the example, each module of the data acquisition system based on the cloud computing technology is located at the cloud end and is developed and deployed by a provider, when a user (including an enterprise or an individual) needs to acquire or capture webpage data based on the data acquisition system, the acquired website is investigated first, then corresponding scheduling configuration information (including but not limited to a subtask ID when the website corresponds to a data acquisition task, a scheduling generator, a downloader, a scheduling timing sequence of a parser, a priority of the website corresponding to the data acquisition task, and the like) is configured for the website, and URI generator and parser parameters are configured according to the type of the website and the requirements of the website, and downloader parameters can be configured if necessary. Therefore, the universal data acquisition system provided by the provider can be changed into a customized and personalized data acquisition system which meets the requirements of the user, so that the data acquisition system is more personalized and can meet the data acquisition requirements of the user.

Referring to fig. 7, the present invention also discloses a website public data collecting method, which includes the following steps:

s101, loading a data acquisition task;

the data collection task may refer to a data collection task related to the search content, which is generated by the user through various search keywords.

S102, scheduling a generator according to a data acquisition task to generate a URI (Uniform resource identifier) of each website page to be acquired corresponding to the data acquisition task;

in this step, corresponding to the above embodiment, the generator is scheduled by the generation task scheduler to generate a URI of each website page to be collected, and the generated URI is stored in the corresponding database for downloading by a subsequent downloader. Referring to fig. 9 and 10, the schedule generating step includes the following sub-steps:

firstly, scheduling process:

s1021, the Master reads in scheduling configuration information, and the scheduling configuration information comprises: number of nodes, generator type, generation rules, scheduling frequency, database information, execution timing, etc.

S1022, the task generation scheduler activates distribution work according to the scheduling time sequence;

s1023, the generated task scheduler inserts the generator into a distribution queue according to the distribution priority;

s1024, judging whether the business trip or the abnormity occurs; if the error or the abnormality occurs, the step S1025 is entered, otherwise, the step S1026 is entered;

s1025, if the error occurs, generating error information to be added to the error log, and returning to the step S1022;

secondly, a generation process:

s1026, the Slave acquires the generator from the distribution queue and executes the generator so as to generate the URI;

s1027, starting a process execution generator script by the Slave;

s1028, judging whether the operation is error-reported; if the error is reported, the step S1029 is carried out, otherwise, the step S102A is carried out;

s1029, adding the error information to the error log and recovering the generator;

S102A, saving the output structure URI in the database, and writing the operation log in the generation log.

S103, scheduling a downloader according to the generated URI of the website pages to be acquired so as to download original data corresponding to the URI of each website page to be acquired;

in the scheduling process of the generator, the main server loads scheduling configuration information, triggers tasks at regular time according to scheduling time sequence crontab, inserts task functions into an asynchronous queue according to the priority of subtasks (the website to be acquired corresponds to the processes of generation, downloading, analysis and the like of a data acquisition task), and if the inserting process is abnormal, adds abnormal information into an alarm log and adds error information into an error log. If the insertion is normal, the worker can grab the task from the asynchronous queue and perform the next operation.

Referring to fig. 11 and 12, the step of scheduling download includes the following sub-steps:

firstly, scheduling step:

s1031, Master reads in scheduling configuration information, the scheduling configuration information includes: node number, downloader type, crawl frequency, plug-in configuration information, database information, execution timing, access timeouts, and the like.

S1032, the download task scheduler activates distribution work according to the scheduling time sequence;

s1033, judging priority queue limit; if yes, entering the step S1034, otherwise, ending;

s1034, inserting the downloader into a downloader distribution queue according to the distribution priority;

s1035, judging whether an error or an abnormality occurs; if yes, the step S1036 is carried out, otherwise, the step S1037 is carried out;

s1036, adding the error information to the error log, reactivating the scheduling time sequence and entering the step S1034;

secondly, downloading:

s1037, the Slave acquires and executes the downloader from the downloader distribution queue;

s1038, the Slave reads in download configuration information (which can be download configuration parameters defined by a user) of the website corresponding to the subtask of the data acquisition task;

s1039, the Slave reads in the URI under the subtask from the database (the URI stored in the database in the step S102), and marks the read state of the URI;

S103A, according to the downloading configuration information, the Slave runs a downloader to crawl URI source data;

S103B, judging whether an error occurs, if so, entering S103C, and otherwise, turning to S103D;

S103C, adding error information to the error log and recovering the URI and the downloader;

S103D stores the download source data in the database, and adds the operation information to the download log.

And in the scheduling process of the downloader (figure 11), a main server operated by a program loads scheduling configuration information of the downloader, the distribution work is activated according to a crontab timing task, the program presses a task function of the downloader and a URI into an asynchronous queue according to the priority of the joba, if the insertion process is abnormal, abnormal information is added into an alarm log, error information is added into the error log, and the error URI is reinserted into the next round of downloading task. If the insertion is normal, the worker can grab the task from the asynchronous queue and perform the next operation. A downloader executing process (fig. 12), wherein the downloader worker captures a jobb from a downloader asynchronous queue to the local and executes the jobb, reads download configuration information of the JOB, updates the mark state of the URI, loads an agent IP pool according to configuration, loads a verification code cracking tool, performs breadth or depth priority search on an HTML page, operates the downloader to crawl URI source data, and if an error occurs, adds error information to an error log to recover the URI and the downloader; otherwise, the downloaded source data is stored in the database, and the running information is added to the download log.

And S104, scheduling a resolver according to the public data corresponding to the URI corresponding to each website page to be acquired downloaded by the downloader to resolve and structurally store the public data downloaded by the downloader.

In this embodiment, please refer to fig. 13, an early warning mechanism flow is further provided, the main program loads an alarm configuration item, and starts a sub-process to perform real-time monitoring on an operation log, a download log and an error log, early warning can be performed according to latest or most frequent information of a worker and the number of triggered keywords, and if the statistical information reaches an early warning threshold, the early warning information is sent to a user of the data acquisition system in an email form.

Referring to fig. 8, fig. 8 is a flowchart illustrating a website public data collection method according to another embodiment of the present invention. The website public data acquisition method of the embodiment comprises the following substeps:

s201, analyzing a website to be collected, and creating a subtask (JOB) of the website corresponding to each data collection task and scheduling configuration information thereof, wherein the scheduling configuration information includes but is not limited to: scheduling configuration information of the generator, the downloader and the parser.

S202, configuring corresponding generator parameters or generating rule scripts according to the characteristics of the website to be acquired, so that the generator can generate the URI of the website page to be acquired according to the configured parameters or the generating rules.

S203, configuring corresponding downloader parameters according to the characteristics of the website to be acquired and the data acquisition requirements, so that the downloader can download the public data corresponding to the URI of the website page to be acquired according to the configured parameters.

S204, configuring corresponding resolver parameters or resolution rule scripts and data export configuration scripts according to the characteristics of the website to be acquired, so that the resolver can resolve the public data of the website page to be acquired according to the configured parameters or resolution rules and store the public data in a structured mode.

S205, loading a data acquisition task;

s206, scheduling the generator according to the data acquisition task to generate a URI of each website page to be acquired corresponding to the data acquisition task;

s207, scheduling a downloader according to the generated URI of the website pages to be acquired so as to download public data corresponding to the URI of each website page to be acquired;

and S208, scheduling a resolver according to the public data corresponding to the URI corresponding to each website page to be acquired downloaded by the downloader to resolve and structurally store the public data downloaded by the downloader.

According to the embodiment of the invention, a distributed credit data multi-task acquisition scheme based on the cloud computing technology is constructed, and the system adopts a layered architecture, so that the coupling and the complexity of the system can be reduced, the specific layer can be positioned, error tracking and customized alarm prompting are facilitated, and the stability of the system is further improved. For companies relying on data and having public opinion monitoring requirements, the system has the advantages that requirements on timeliness of more attention are high, website collection is complicated, and data collection amount is large, the system aims to construct a crawler frame which is distributed in layered cooperation, can be horizontally expanded and is in asynchronous queue, coupling degree of function modules is reduced, the system can achieve distributed elastic expansion through rapid deployment of functions according to task amount and load conditions, and load capacity of the system is rapidly improved. Compared with the traditional web crawler, the traditional web crawler has the advantages of high requirements on developers, low development speed, low general applicability, high development, operation and maintenance cost, and incapability of well meeting the requirements on mass monitoring, high real-time performance and large data volume. The method and the system solve the problems of the traditional crawler in time, can help developers to quickly construct a powerful data acquisition system, support custom crawler script configuration, and greatly improve the development speed of new crawlers. And (3) self-defining configuration of the credit data intelligent acquisition system based on the cloud computing technology.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A data acquisition system based on a cloud computing technology comprises a task scheduler, a generator, a downloader and a parser;

the method is characterized in that: the task scheduler is used for scheduling the generator, the downloader and the parser according to each data acquisition task so as to acquire data related to each data acquisition task;

the parser is used for parsing the original data downloaded by the downloader into structured data according to the schedule of the task scheduler,

the system also comprises a generator self-defining module which is used for a user to self-define a corresponding generator generating rule in advance according to the characteristics of the website page to be acquired, so that the generator can generate the URI of the website to be acquired according to the corresponding rule;

the system also comprises a downloader self-defining module which is used for a user to self-define a corresponding downloader rule in advance according to the characteristics of the website to be acquired, so that the downloader can download the original data corresponding to the URI of the website to be acquired according to the corresponding rule;

the system further comprises a parser self-defining module which is used for a user to self-define a corresponding parser parsing rule in advance according to the data characteristics of the website page to be collected, so that the parser can parse the original data of the website to be collected downloaded by the downloader according to the corresponding rule;

the data storage module is used for storing the configuration information related to the data acquisition task and the user-defined rule; the server is also used for storing the URI of each website to be collected related page generated by the generator and related to the data collection task for the downloader to call; the system is also used for storing the original data downloaded by the downloader and of each website to be acquired for the parser to call; the parser is also used for storing the structured data parsed by the parser;

the system further comprises an acquisition layer, a structural layer and a storage layer, wherein the storage layer comprises the data storage module and a duplication preventer, the acquisition layer comprises the generator and the downloader, and the structural layer comprises the parser;

the system layer is used for managing each data acquisition task and subtasks corresponding to each website, managing the system and managing the authority, wherein the system management comprises plug-in management, storage management, alarm management, anti-duplication management and machine management, the plug-in management comprises a generator, a downloader, a resolver, verification code identification, an IP agent, an OCR module and an NLP module management, the authority management comprises role authority management, user management, group authority, an authority list and operation log management, the subtask management is used for visually checking the progress and error log data of each subtask so as to conveniently monitor operation and maintenance,

the task scheduler schedules the generator, the downloader and the parser for the corresponding website acquisition tasks from first to last according to the website priority order related to the data acquisition tasks, finishes the scheduling and control of the whole process, and finishes the data acquisition of the corresponding websites from first to last in sequence, thereby realizing the scheduling and data acquisition of a plurality of acquisition tasks;

the task scheduler comprises a generated task scheduler, a task distribution queue and a task scheduling queue, wherein the generated task scheduler is used for acquiring scheduling configuration information generated by a URI (Uniform resource identifier) of each website page to be acquired related to the data acquisition task so as to acquire a scheduling time sequence of each website to be acquired, and activating distribution work according to the scheduling time sequence, so that a corresponding generator is inserted into each website page to be acquired from beginning to end in sequence to the task distribution queue, and a process is started to execute the corresponding generator so as to generate the URI of the corresponding website page according to a corresponding rule;

the task scheduler comprises a download task scheduler, which is used for acquiring scheduling configuration information downloaded by each website page to be acquired related to the data acquisition task so as to acquire a scheduling time sequence of each website page to be acquired, and activating distribution work according to the scheduling time sequence, so that a corresponding downloader is sequentially inserted into each website page to be acquired from beginning to end to a download task distribution queue, and a process is started to execute the corresponding downloader, so that the original data related to the corresponding website page URI is downloaded;

the task scheduler comprises an analysis task scheduler which is used for acquiring scheduling configuration information for analyzing the original data of each website page to be acquired related to the data acquisition task so as to acquire the scheduling time sequence of each website page to be acquired, activating distribution work according to the scheduling time sequence, and starting a process to execute a corresponding analyzer so as to analyze the original data related to the corresponding website into structured data.

2. A website public data acquisition method applying the data acquisition system based on the cloud computing technology according to claim 1, comprising the following steps:

loading a data acquisition task;

scheduling a parser according to the original data corresponding to the URI corresponding to each website page to be acquired downloaded by the downloader to parse and structurally store the original data downloaded by the downloader;

the method is characterized by further comprising the following steps before the step of acquiring the data acquisition task:

configuring corresponding generator parameters or generating rule scripts according to the characteristics of the website to be acquired, so that the generator can generate a URI of the website page to be acquired according to the configured parameters or the generating rules;

before the step of acquiring the data collection task, the method further comprises the following steps:

configuring corresponding downloader parameters according to the characteristics of the website to be acquired and data acquisition requirements, and configuring and activating related plug-ins, so that the downloader can download public data corresponding to the URI of the website page to be acquired according to the configured parameters;

and configuring corresponding resolver parameters, resolution rule scripts and data export configuration scripts according to the characteristics of the website to be acquired, so that the resolver can resolve the original data of the website page to be acquired according to the configured parameters or resolution rules and store the original data in the data storage module in a structured manner.