CN111753169A - Data acquisition system based on internet - Google Patents
Data acquisition system based on internet Download PDFInfo
- Publication number
- CN111753169A CN111753169A CN202010604543.XA CN202010604543A CN111753169A CN 111753169 A CN111753169 A CN 111753169A CN 202010604543 A CN202010604543 A CN 202010604543A CN 111753169 A CN111753169 A CN 111753169A
- Authority
- CN
- China
- Prior art keywords
- module
- configuration
- acquisition
- task
- interface
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 claims abstract description 49
- 238000004519 manufacturing process Methods 0.000 claims abstract description 4
- 230000008569 process Effects 0.000 claims description 45
- 238000012545 processing Methods 0.000 claims description 16
- 238000003860 storage Methods 0.000 claims description 12
- 238000012795 verification Methods 0.000 claims description 12
- 230000002159 abnormal effect Effects 0.000 claims description 10
- 238000013500 data storage Methods 0.000 claims description 10
- 238000007405 data analysis Methods 0.000 claims description 8
- 238000004422 calculation algorithm Methods 0.000 claims description 6
- 230000008859 change Effects 0.000 claims description 4
- 238000012217 deletion Methods 0.000 claims description 4
- 230000037430 deletion Effects 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims description 4
- 238000013480 data collection Methods 0.000 claims description 3
- 238000012423 maintenance Methods 0.000 claims description 2
- 230000008878 coupling Effects 0.000 abstract description 2
- 238000010168 coupling process Methods 0.000 abstract description 2
- 238000005859 coupling reaction Methods 0.000 abstract description 2
- 239000003795 chemical substances by application Substances 0.000 description 24
- 238000010586 diagram Methods 0.000 description 7
- 238000007726 management method Methods 0.000 description 7
- 238000004590 computer program Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 3
- 238000003306 harvesting Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000013515 script Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
- G06F16/972—Access to data in other repository systems, e.g. legacy data or dynamic Web page generation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
Abstract
The invention relates to a data acquisition system based on the Internet, which comprises an acquisition server and an acquisition client, wherein the acquisition client comprises a plurality of acquisition components, a plurality of pool components and a plurality of interface components; the acquisition assembly comprises a configuration analyzer, a task scheduler, a flow controller, a flow executor and a flow returner; the pool component comprises an acquisition task pool, a configuration pool and an acquisition thread resource pool; the interface component comprises an acquisition task interface, an acquisition configuration interface and a data return interface; wherein at least one of said acquisition components invokes at least one of said pool components via at least one of said interface components. The data acquisition system based on the Internet has the technical problems of good loose coupling performance, high robustness, high acquisition data streaming speed, simple implementation method and the like, and is well applied to multiple fields of finance, manufacturing industry and the like.
Description
Technical Field
The invention belongs to the field of databases for collecting and retrieving from the Internet, and particularly relates to a data collection system based on the Internet.
Background
The known data acquisition system comprises an acquisition client and a server, and when the data acquisition system is constructed, the performance of the data acquisition system has larger difference due to different functions, data flow and main implementation methods of the acquisition system.
Disclosure of Invention
In order to solve the problems, the invention provides an internet-based data acquisition system, which comprises an acquisition server and an acquisition client, wherein the acquisition client comprises a plurality of acquisition components, a plurality of pool components and a plurality of interface components; the acquisition assembly comprises a configuration analyzer, a task scheduler, a flow controller, a flow executor and a flow returner; the pool component comprises an acquisition task pool, a configuration pool and an acquisition thread resource pool; the interface component comprises an acquisition task interface, an acquisition configuration interface and a data return interface; wherein at least one of said acquisition components invokes at least one of said pool components via at least one of said interface components.
The data acquisition system based on the internet has the advantages of being good in loose coupling performance, high in robustness, high in acquisition data streaming speed, simple in implementation method and the like, and is well applied to multiple fields of finance, manufacturing industry and the like.
Drawings
FIG. 1 is a schematic diagram of a system architecture;
FIG. 2 is a schematic diagram of task execution management;
FIG. 3 is a schematic diagram of acquisition tasks;
FIG. 4 is a schematic diagram of an acquisition configuration;
FIG. 5 is a schematic diagram of a data return configuration;
FIG. 6 is a schematic diagram of an acquisition agent configuration;
FIG. 7 is a schematic diagram of a coding arrangement.
Detailed Description
The system architecture of the internet-based data acquisition system of the present invention can be implemented in various ways, and the following exemplary system architecture should not be taken as a specific limitation to the scope of the present invention. In these embodiments, reference is made to FIG. 1The system architecture of the data acquisition system comprises DB storage layers (SDB \ Postgre SQL, SDB API) and MYSQL, and the service logic layers comprise (task queues, task scheduling, task generators, secondary task creation, data cleaning, agent management, log management, verification code scheduling, acquisition client updating, acquisition client state management, acquisition return data receiving), user management (acquisition configuration, task management, task monitoring and acquisition client management), and third-party components (ActiveMQ)TMCouchbase); interface layer (Mina (real-time connection) WebService (call return)); client (a server acquisition point (Python), Windows acquisition Client software (jave, C #)), and IE browser plug-in (C #).
In some embodiments of the internet-based data acquisition system of the present invention, the system comprises an acquisition server and an acquisition client, wherein the acquisition client comprises a plurality of acquisition components, a plurality of pool components and a plurality of interface components; the acquisition assembly comprises a configuration analyzer, a task scheduler, a flow controller, a flow executor and a flow returner; the pool component comprises an acquisition task pool, a configuration pool and an acquisition thread resource pool; the interface component comprises an acquisition task interface, an acquisition configuration interface and a data return interface; wherein at least one of said acquisition components invokes at least one of said pool components via at least one of said interface components.
In some embodiments of the present invention, referring to fig. 2, when the collection client executes a collection task, the collection component is configured as follows:
the configuration parser, namely the configuration parsing main module, wherein a secondary module of the configuration parsing main module comprises a configuration instantiation module, a configuration version comparison module and a configuration pool generation module, wherein the configuration instantiation module is configured to instantiate a configuration to a local disk in a file format and store the configuration in a JSON format file (the configuration comprises configuration parameters required by task execution, such as a request method, a request header, a URL template, whether a verification code is required, a data extraction rule, a process node and the like); the configuration version comparison module is configured to perform configuration comparison on tasks returned from the acquisition server, and the configuration pool generation module is configured to place the unexecuted task configuration into a configuration pool;
the task scheduler is a task scheduling main module, and the task scheduling main module is configured to: acquiring acquisition configuration and flow configuration of a corresponding task from the configuration pool by an acquisition thread through a task scheduling main module, putting the acquisition configuration and the flow configuration into the acquisition thread resource pool, and starting a flow controller; the secondary module of the task scheduling main module comprises a scheduling algorithm module, an acquisition thread starting module, a task interface obtaining module, an acquisition thread resource pool generating module, an acquisition thread resource pool clearing module, an acquisition task pool clearing module and a configuration pool clearing module;
the process controller is a process controller master module configured to: acquiring flow configuration from an acquisition thread resource pool, informing a flow executor to call the execution sequence of the execution node flow according to the execution sequence in the flow configuration, and performing corresponding processing according to the abnormal processing condition of the flow configuration when the execution result is abnormal; the secondary module of the flow controller main module comprises a flow node execution module, a configuration version comparison module and a configuration pool generation module;
the process executor is a process executor main module, and the process executor main module is configured to: executing node services, and sending an execution result to the process controller, wherein the node services include but are not limited to login, page fetching, standardized processing and extraction; the secondary module of the flow executor main module comprises a flow node execution module, an acquisition agent module, a code printing module, a login module, a page fetching module, a standardization module and a digital extraction module, wherein the flow node execution module is configured to input flow node configuration and a previous flow node return value, execute code logic defined by a node and output a current flow node return value;
the flow returner is a flow return main module, and the flow return main module is configured to: receiving the last node information sent by the process controller, and sending the process end information to the task scheduler; the secondary module of the flow return main module comprises a return data generating module and a data calling return interface module.
In some embodiments of the invention, the configuration of the configuration parser, when executed: calling an acquisition configuration interface, obtaining a task to be configured, and then comparing the task with a local configuration file preferentially, if the local configuration version is the same as the process configuration version, putting the local configuration into a configuration pool, and if the versions are different, downloading corresponding configuration from an acquisition server, and then putting the configuration into the configuration pool; the configuration files include, but are not limited to: collecting client ID, collecting client version, maximum task number, collecting task number threshold in a task pool, allowed maximum thread number and collecting client computer configuration information.
In some embodiments of the invention, the configuration of the task scheduler includes, but is not limited to, configuring the latest execution time (absolute value) of the task (in most embodiments of the invention, the execution of the task is of a validity period, such as a task containing session authentication information, which is usually 30 minutes out and is executed in a validity period to make sense), task priority, website collection frequency, and maximum number of threads allowed by the collection client; the scheduling algorithm module is configured to: scheduling consideration factors and task pool collection when executing input, and task ID required to be executed when executing output; the task acquisition interface module is configured to acquire an ID of a collection client when executing input, and to execute a task queue to be executed when executing output; the production acquisition thread resource pool module is configured to read configuration information corresponding to the current acquisition task from the configuration pool and generate an acquisition thread resource pool.
In some embodiments of the present invention, in the configuration of the flow controller, it is necessary to determine whether exception handling is required according to a return execution condition, execute exception handling scheduling if exception handling is required, and determine whether the node is the last node if exception handling is not required; in the process of executing exception handling scheduling, if the exception handling scheduling fails partially, the exception handling scheduling and the execution normal sequential scheduling result are sent to a flow executor, and if the exception handling scheduling and the execution normal sequential scheduling fail completely, the exception handling scheduling and the execution normal sequential scheduling are sent to a flow returner; in the process of judging whether the node is the last node or not, if the node is the last node, the process returner is informed, and if the node is not the last node, normal sequential scheduling is carried out;
when the flow node execution module is executed, the flow exception configuration is carried out when the input is executed, the return value of the previous flow node is returned, and the name of the next node is carried out when the output is executed; when the configuration version comparison module is executed, comparing the task configuration returned from the acquisition server, and if the task configuration is updated, updating the task configuration; and when the module for generating the configuration pool is executed, the task configuration which is not executed is placed into the configuration pool.
In some embodiments of the present invention, in the process of executing the process executor, the process executor executes the program in each process node, all the service resources required by the program are predefined in the collection thread resource pool, and the processing result is written in the collection thread resource pool; the process nodes included in the acquisition client include, but are not limited to: logging in, acquiring a page, standardizing, and extracting data, wherein each process node defines different entry addresses and executes different code blocks.
In some embodiments of the present invention, each time the acquisition client finishes executing an acquisition task, the interface component is configured as follows:
the get collection task interface is configured to: judging the number of tasks in the collection task pool, and executing the task obtaining function to call and obtain a collection task interface when the number of tasks in the collection task pool is less than a certain threshold value; the number of returned tasks is set by the acquisition server;
the pool assembly is configured as follows:
the collection task pool is configured to: storing the tasks acquired from the acquisition server without acquisition;
the configuration pool is configured to: storing all configurations of the tasks which are not collected, including collecting configurations and process configurations;
the collection thread resource pool is configured to: storing all resources required for acquisition, reading required contents from the acquisition tasks by the acquisition tasks, and writing results into the acquisition tasks; the collection thread resource pool is initialized before a single collection task begins.
In some embodiments of the present invention, the collection server includes a server task scheduling main module, a server collected return data processing main module, a server agent verification main module, a server code printing service main module, and a plurality of server interface components;
the at least one server interface is configured to: acquiring an acquisition task, referring to fig. 3, the configuration process is: acquiring a collection task, analyzing interface data (collecting a client ID), reading collection client configuration (collecting the client configuration, namely collecting the client ID, collecting a client version, taking the maximum number of tasks, collecting a number threshold of people in a task pool, allowing the maximum number of threads, collecting client computer configuration information), task scheduling (factors influencing scheduling, such as task absolute time, whether to specify the collection client, priority and the number of the collection client to take the tasks), assembling return data and interface return (task format and interface format), and writing back a task state;
the at least one server interface is configured to: as shown in fig. 4, the acquisition configuration: acquiring configuration, analyzing interface data (configuration ID, task GUID), reading a configuration table, assembling return data and interface return (task format and interface format);
the at least one server interface is configured to: as shown in fig. 5, the data return configuration: data return, interface data analysis, data collection and storage, and (if successful) subtask generation (collection configuration, list subtask and detail subtask);
the at least one server interface is configured to: as shown in fig. 6, the acquisition agent configuration: acquiring an agent, analyzing interface data, taking the agent, assembling return data and interface return, and updating agent information;
the at least one server interface is configured to: as shown in fig. 7, the coding configuration: code scanning, interface data analysis, code printing interface (code printing type) calling, returned data assembly or interface returning, and code printing information recording (code printing information, time, task id, code printing type, code printing picture, returned value).
In some embodiments of the invention, the server task scheduling master module is configured to: overall control is carried out on the collection tasks of the collection clients, and scheduling is carried out according to the latest execution time of the tasks, whether the collection clients are bound or not, the priority and the maximum execution number of the collection clients; the secondary module of the server task scheduling main module comprises a scheduling algorithm module, an acquisition client configuration module and a task write-back state module, wherein when the acquisition client configuration module is executed, the acquisition client configuration module comprises the number of returned tasks;
the server collected return data processing main module is configured to: the acquisition client side returns a result to the acquisition server after acquiring the data, the acquisition server processes the data in the function after receiving the data, and the data can be put in storage or a subtask can be generated according to the situation; the secondary module of the main module for processing the collected return data of the server comprises a collected data analysis module, a subtask generation module, a subtask storage module, a collected data storage module and an abnormal data storage module; when the collected data analysis module is executed, the collected result data comprises success and failure and a success data type, when the subtask generation module is executed, the subtask is a list task or a detail task according to the information in the configuration file, the abnormal data storage module is executed, and when the collected data storage fails, the collected data is placed into an abnormal database.
In some embodiments of the invention, the server agent validation master module is configured to: the module verifies the validity of the agent and removes the invalid agent. The agent maintenance main module comprises an agent verification module, an agent deletion module and an agent information change module, wherein when the agent verification module is executed, the execution input module is used for verifying the agent, the execution output module is used for verifying/invalidating the agent, when the agent deletion module is executed, the invalid agent is removed, and when the agent information change module is executed, the use of the agent by tasks is recorded;
the server coding service primary module is configured to: the code printing service calls an external interface to process the verification code uploaded by the acquisition client, and a secondary module of a code printing service main module comprises a code printing type definition module, an external interface calling module and a code printing information storage module, wherein when the external interface calling module is executed, an invalid agent is removed, and when the code printing information storage module is executed, a task guid, a verification code picture and a code printing result are obtained;
the server data update master module is configured to: according to data which are updated on the Internet irregularly, the acquired data are updated regularly, and the updating is divided into slight updating and complete updating, wherein the slight updating only compares url (the speed is higher), and the complete updating recaptures the data of the formulated website; the secondary module of the server data updating main module comprises a timer module, a slight updating module and a complete updating module, wherein when the timer is executed, the updating operation is executed regularly, when the slight updating module is executed, only the URL is compared, and when the complete updating is executed, the website data is collected again.
In some embodiments of the invention, the interface component is configured to:
get tasks
The acquisition client acquires an acquisition task from the acquisition server through the interface;
acquiring client sending parameters: collecting a client ID;
the collection server returns parameters: the number of tasks and the task set;
data return
The data collected by the collecting client is returned to the collecting server through the interface;
sending parameters: task guid, return identification, data packet (json), source file name, file type;
obtaining configuration
All configuration information is obtained from the interface and comprises login configuration and acquisition configuration (the acquisition configuration comprises flow configuration);
acquiring client sending parameters: configuration guid, configuration type (register acquisition);
the collection server returns parameters: configuration guid, configuration content and configuration type;
acquisition agent
Acquiring client sending parameters: collecting client ID and task GUID;
the collection server returns parameters: the proxy type, proxy ip, port, user name and password;
code printing request
Acquiring client sending parameters: task id, code printing type and code printing picture;
the collection server returns parameters: task id, code value.
Implementations and functional operations of the subject matter described in this specification can be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware, including the structures disclosed in this specification and their structural equivalents, or combinations of more than one of the foregoing. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on one or more tangible, non-transitory program carriers, for execution by, or to control the operation of, data processing apparatus.
Alternatively or in addition, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution with a data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of the foregoing.
A computer program (which may also be referred to or described as a program, software application, module, software module, script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in: in a markup language document; in a single file dedicated to the relevant program; or in multiple coordinated files, such as files that store one or more modules, sub programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
To send interactions with a user, embodiments of the subject matter described in this specification can be implemented on a computer having: a display device, for example, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to a user; and a keyboard and a pointing device, such as a mouse or trackball, by which a user can communicate input to the computer. Other kinds of devices may also be used to send interactions with the user; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic, speech, or tactile input. In addition, the computer may interact with the user by sending documents to a device used by the user and receiving documents from the device; for example, by sending a web page to a web browser on the user's acquisition client device in response to a request received from the web browser.
Implementations of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data acquisition server, or that includes a middleware component, e.g., an application acquisition server, or that includes a front-end component, e.g., an acquisition client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back-end, middleware, or front-end components. The components in the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network ("LAN") and a wide area network ("WAN"), e.g., the Internet. __ the computing system may include a harvesting client and a harvesting server. The acquisition client and acquisition server are typically remote from each other and typically interact through a communication network. The relationship between the acquisition client and the acquisition server is generated using computer programs running on respective computers and having an acquisition client-acquisition server relationship to each other.
Claims (10)
1. A data acquisition system based on the Internet comprises an acquisition server and an acquisition client, and is characterized in that the acquisition client comprises a plurality of acquisition components, a plurality of pool components and a plurality of interface components; the acquisition assembly comprises a configuration analyzer, a task scheduler, a flow controller, a flow executor and a flow returner; the pool component comprises an acquisition task pool, a configuration pool and an acquisition thread resource pool; the interface component comprises an acquisition task interface, an acquisition configuration interface and a data return interface; wherein at least one of said acquisition components invokes at least one of said pool components via at least one of said interface components.
2. The system of claim 1, the acquisition component is configured as follows when the acquisition client performs an acquisition task:
the configuration parser is a configuration parsing main module, and secondary modules of the configuration parsing main module comprise a configuration instantiation module, a configuration version comparison module and a configuration pool generation module, wherein the configuration instantiation module is configured to instantiate a configuration to a local disk in a file format and store the configuration in a JSON format file; the configuration version comparison module is configured to perform configuration comparison on tasks returned from the acquisition server, and the configuration pool generation module is configured to place the unexecuted task configuration into a configuration pool;
the task scheduler is a task scheduling main module, and the task scheduling main module is configured to: acquiring acquisition configuration and flow configuration of a corresponding task from the configuration pool by an acquisition thread through a task scheduling main module, putting the acquisition configuration and the flow configuration into the acquisition thread resource pool, and starting a flow controller; the secondary module of the task scheduling main module comprises a scheduling algorithm module, an acquisition thread starting module, a task interface obtaining module, an acquisition thread resource pool generating module, an acquisition thread resource pool clearing module, an acquisition task pool clearing module and a configuration pool clearing module;
the process controller is a process controller master module configured to: acquiring flow configuration from an acquisition thread resource pool, informing a flow executor to call the execution sequence of the execution node flow according to the execution sequence in the flow configuration, and performing corresponding processing according to the abnormal processing condition of the flow configuration when the execution result is abnormal; the secondary module of the flow controller main module comprises a flow node execution module, a configuration version comparison module and a configuration pool generation module;
the process executor is a process executor main module, and the process executor main module is configured to: executing node services, and sending an execution result to the process controller, wherein the node services include but are not limited to login, page fetching, standardized processing and extraction; the secondary module of the flow executor main module comprises a flow node execution module, an acquisition agent module, a code printing module, a login module, a page fetching module, a standardization module and a digital extraction module, wherein the flow node execution module is configured to input flow node configuration and a previous flow node return value, execute code logic defined by a node and output a current flow node return value;
the flow returner is a flow return main module, and the flow return main module is configured to: receiving the last node information sent by the process controller, and sending the process end information to the task scheduler; the secondary module of the flow return main module comprises a return data generating module and a data calling return interface module.
3. The system of claim 1, wherein the configuration of the configuration parser, when executed: calling an acquisition configuration interface, obtaining a task to be configured, and then comparing the task with a local configuration file preferentially, if the local configuration version is the same as the process configuration version, putting the local configuration into a configuration pool, and if the versions are different, downloading corresponding configuration from an acquisition server, and then putting the configuration into the configuration pool; the configuration files include, but are not limited to: collecting client ID, collecting client version, maximum task number, collecting task number threshold in a task pool, allowed maximum thread number and collecting client computer configuration information.
4. The system of claim 1, wherein the configuration of the task scheduler includes, but is not limited to, configuring the task's latest execution time (absolute value) (in most embodiments of the invention, the execution of the task is of expiration date, such as a task containing session authentication information, which is typically 30 minutes expired and is meaningful if it is executed within validity), task priority, website collection frequency, maximum number of threads allowed by the collection client; the scheduling algorithm module is configured to: scheduling consideration factors and task pool collection when executing input, and task ID required to be executed when executing output; the task acquisition interface module is configured to acquire an ID of a collection client when executing input, and to execute a task queue to be executed when executing output; the production acquisition thread resource pool module is configured to read configuration information corresponding to the current acquisition task from the configuration pool and generate an acquisition thread resource pool.
5. The system of claim 1, wherein the flow controller is configured to determine whether exception handling is required according to a return execution condition, execute exception handling scheduling if the exception handling is required, and determine whether the node is a last node if the exception handling is not required; in the process of executing exception handling scheduling, if the exception handling scheduling fails partially, the exception handling scheduling and the execution normal sequential scheduling result are sent to a flow executor, and if the exception handling scheduling and the execution normal sequential scheduling fail completely, the exception handling scheduling and the execution normal sequential scheduling are sent to a flow returner; in the process of judging whether the node is the last node or not, if the node is the last node, the process returner is informed, and if the node is not the last node, normal sequential scheduling is carried out;
when the flow node execution module is executed, the flow exception configuration is carried out when the input is executed, the return value of the previous flow node is returned, and the name of the next node is carried out when the output is executed; when the configuration version comparison module is executed, comparing the task configuration returned from the acquisition server, and if the task configuration is updated, updating the task configuration; and when the module for generating the configuration pool is executed, the task configuration which is not executed is placed into the configuration pool.
6. The system according to claim 1, wherein in the process of executing the process executor, the process executor executes a program in each process node, all business resources required by the program are predefined in the collection thread resource pool, and processing results are written in the collection thread resource pool; the process nodes included in the acquisition client include, but are not limited to: logging in, acquiring a page, standardizing, and extracting data, wherein each process node defines different entry addresses and executes different code blocks.
7. The system of claim 1, the acquisition client, each time an acquisition task is performed, the interface component is configured as follows:
the get collection task interface is configured to: judging the number of tasks in the collection task pool, and executing the task obtaining function to call and obtain a collection task interface when the number of tasks in the collection task pool is less than a certain threshold value; the number of returned tasks is set by the acquisition server;
the pool assembly is configured as follows:
the collection task pool is configured to: storing the tasks acquired from the acquisition server without acquisition;
the configuration pool is configured to: storing all configurations of the tasks which are not collected, including collecting configurations and process configurations;
the collection thread resource pool is configured to: storing all resources required for acquisition, reading required contents from the acquisition tasks by the acquisition tasks, and writing results into the acquisition tasks; the collection thread resource pool is initialized before a single collection task begins.
8. The system of claim 1, wherein the collection server comprises a server task scheduling main module, a server collected return data processing main module, a server agent verification main module, a server coding service main module and a plurality of server interface components;
the at least one server interface is configured to: acquiring an acquisition task, referring to fig. 3, the configuration process is: acquiring a collection task, analyzing interface data (collecting a client ID), reading collection client configuration (collecting the client configuration, namely collecting the client ID, collecting a client version, taking the maximum number of tasks, collecting a number threshold of people in a task pool, allowing the maximum number of threads, collecting client computer configuration information), task scheduling (factors influencing scheduling, such as task absolute time, whether to specify the collection client, priority and the number of the collection client to take the tasks), assembling return data and interface return (task format and interface format), and writing back a task state;
the at least one server interface is configured to: as shown in fig. 4, the acquisition configuration: acquiring configuration, analyzing interface data (configuration ID, task GUID), reading a configuration table, assembling return data and interface return (task format and interface format);
the at least one server interface is configured to: as shown in fig. 5, the data return configuration: data return, interface data analysis, data collection and storage, and (if successful) subtask generation (collection configuration, list subtask and detail subtask);
the at least one server interface is configured to: as shown in fig. 6, the acquisition agent configuration: acquiring an agent, analyzing interface data, taking the agent, assembling return data and interface return, and updating agent information;
the at least one server interface is configured to: as shown in fig. 7, the coding configuration: code scanning, interface data analysis, code printing interface (code printing type) calling, returned data assembly or interface returning, and code printing information recording (code printing information, time, task id, code printing type, code printing picture, returned value).
9. The system of claim 1, wherein the server task scheduling master module is configured to: overall control is carried out on the collection tasks of the collection clients, and scheduling is carried out according to the latest execution time of the tasks, whether the collection clients are bound or not, the priority and the maximum execution number of the collection clients; the secondary module of the server task scheduling main module comprises a scheduling algorithm module, an acquisition client configuration module and a task write-back state module, wherein when the acquisition client configuration module is executed, the acquisition client configuration module comprises the number of returned tasks;
the server collected return data processing main module is configured to: the acquisition client side returns a result to the acquisition server after acquiring the data, the acquisition server processes the data in the function after receiving the data, and the data can be put in storage or a subtask can be generated according to the situation; the secondary module of the main module for processing the collected return data of the server comprises a collected data analysis module, a subtask generation module, a subtask storage module, a collected data storage module and an abnormal data storage module; when the collected data analysis module is executed, the collected result data comprises success and failure and a success data type, when the subtask generation module is executed, the subtask is a list task or a detail task according to the information in the configuration file, the abnormal data storage module is executed, and when the collected data storage fails, the collected data is placed into an abnormal database.
10. The system of claim 1, wherein the server agent authentication master module is configured to: the module verifies the validity of the agent and removes the invalid agent. The agent maintenance main module comprises an agent verification module, an agent deletion module and an agent information change module, wherein when the agent verification module is executed, the execution input module is used for verifying the agent, the execution output module is used for verifying/invalidating the agent, when the agent deletion module is executed, the invalid agent is removed, and when the agent information change module is executed, the use of the agent by tasks is recorded;
the server coding service primary module is configured to: the code printing service calls an external interface to process the verification code uploaded by the acquisition client, and a secondary module of a code printing service main module comprises a code printing type definition module, an external interface calling module and a code printing information storage module, wherein when the external interface calling module is executed, an invalid agent is removed, and when the code printing information storage module is executed, a task guid, a verification code picture and a code printing result are obtained;
the server data update master module is configured to: according to data which are updated on the Internet irregularly, the acquired data are updated regularly, and the updating is divided into slight updating and complete updating, wherein the slight updating only compares url (the speed is higher), and the complete updating recaptures the data of the formulated website; the secondary module of the server data updating main module comprises a timer module, a slight updating module and a complete updating module, wherein when the timer is executed, the updating operation is executed regularly, when the slight updating module is executed, only the URL is compared, and when the complete updating is executed, the website data is collected again.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010604543.XA CN111753169B (en) | 2020-06-29 | 2020-06-29 | Data acquisition system based on internet |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010604543.XA CN111753169B (en) | 2020-06-29 | 2020-06-29 | Data acquisition system based on internet |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111753169A true CN111753169A (en) | 2020-10-09 |
CN111753169B CN111753169B (en) | 2021-10-19 |
Family
ID=72677950
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010604543.XA Active CN111753169B (en) | 2020-06-29 | 2020-06-29 | Data acquisition system based on internet |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111753169B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112381503A (en) * | 2020-11-06 | 2021-02-19 | 上海瀚银信息技术有限公司 | Project online optimization management system and method |
CN113132383A (en) * | 2021-04-19 | 2021-07-16 | 烟台中科网络技术研究所 | Network data acquisition method and system |
CN114567621A (en) * | 2022-04-29 | 2022-05-31 | 成都瑞华康源科技有限公司 | Client-side adaptive response content control system, method and storage medium |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2503733A1 (en) * | 2009-12-30 | 2012-09-26 | ZTE Corporation | Data collecting method, data collecting apparatus and network management device |
CN104063756A (en) * | 2014-05-23 | 2014-09-24 | 国网辽宁省电力有限公司本溪供电公司 | Electric power utilization information remote control system |
CN104298550A (en) * | 2014-10-09 | 2015-01-21 | 南通大学 | Hadoop-oriented dynamic scheduling method |
CN104468212A (en) * | 2014-12-03 | 2015-03-25 | 中国科学院计算技术研究所 | Cloud computing data center network intelligent linkage configuration method and system |
CN104683390A (en) * | 2013-11-27 | 2015-06-03 | 上海墨芋电子科技有限公司 | New technology for improving real estate industry resource sharing technology through cloud computing technology |
CN105447088A (en) * | 2015-11-06 | 2016-03-30 | 杭州掘数科技有限公司 | Volunteer computing based multi-tenant professional cloud crawler |
CN106936660A (en) * | 2015-12-31 | 2017-07-07 | 华为软件技术有限公司 | Collecting method and device |
CN107239558A (en) * | 2017-06-09 | 2017-10-10 | 成都布林特信息技术有限公司 | Common interconnection network collecting method |
CN107239563A (en) * | 2017-06-13 | 2017-10-10 | 成都布林特信息技术有限公司 | Public feelings information dynamic monitoring and controlling method |
CN108304498A (en) * | 2018-01-12 | 2018-07-20 | 深圳壹账通智能科技有限公司 | Webpage data acquiring method, device, computer equipment and storage medium |
CN109299069A (en) * | 2018-09-07 | 2019-02-01 | 安徽恒科信息技术有限公司 | A kind of big data acquisition management platform based on internet data acquisition |
CN110765337A (en) * | 2019-11-15 | 2020-02-07 | 中科院计算技术研究所大数据研究院 | Service providing method based on internet big data |
-
2020
- 2020-06-29 CN CN202010604543.XA patent/CN111753169B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2503733A1 (en) * | 2009-12-30 | 2012-09-26 | ZTE Corporation | Data collecting method, data collecting apparatus and network management device |
US20120297393A1 (en) * | 2009-12-30 | 2012-11-22 | Zte Corporation | Data Collecting Method, Data Collecting Apparatus and Network Management Device |
CN104683390A (en) * | 2013-11-27 | 2015-06-03 | 上海墨芋电子科技有限公司 | New technology for improving real estate industry resource sharing technology through cloud computing technology |
CN104063756A (en) * | 2014-05-23 | 2014-09-24 | 国网辽宁省电力有限公司本溪供电公司 | Electric power utilization information remote control system |
CN104298550A (en) * | 2014-10-09 | 2015-01-21 | 南通大学 | Hadoop-oriented dynamic scheduling method |
CN104468212A (en) * | 2014-12-03 | 2015-03-25 | 中国科学院计算技术研究所 | Cloud computing data center network intelligent linkage configuration method and system |
CN105447088A (en) * | 2015-11-06 | 2016-03-30 | 杭州掘数科技有限公司 | Volunteer computing based multi-tenant professional cloud crawler |
CN106936660A (en) * | 2015-12-31 | 2017-07-07 | 华为软件技术有限公司 | Collecting method and device |
CN107239558A (en) * | 2017-06-09 | 2017-10-10 | 成都布林特信息技术有限公司 | Common interconnection network collecting method |
CN107239563A (en) * | 2017-06-13 | 2017-10-10 | 成都布林特信息技术有限公司 | Public feelings information dynamic monitoring and controlling method |
CN108304498A (en) * | 2018-01-12 | 2018-07-20 | 深圳壹账通智能科技有限公司 | Webpage data acquiring method, device, computer equipment and storage medium |
CN109299069A (en) * | 2018-09-07 | 2019-02-01 | 安徽恒科信息技术有限公司 | A kind of big data acquisition management platform based on internet data acquisition |
CN110765337A (en) * | 2019-11-15 | 2020-02-07 | 中科院计算技术研究所大数据研究院 | Service providing method based on internet big data |
Non-Patent Citations (1)
Title |
---|
王正宏: "区域医疗数据采集方法优化", 《电子技术与软件工程》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112381503A (en) * | 2020-11-06 | 2021-02-19 | 上海瀚银信息技术有限公司 | Project online optimization management system and method |
CN113132383A (en) * | 2021-04-19 | 2021-07-16 | 烟台中科网络技术研究所 | Network data acquisition method and system |
CN113132383B (en) * | 2021-04-19 | 2022-03-25 | 烟台中科网络技术研究所 | Network data acquisition method and system |
CN114567621A (en) * | 2022-04-29 | 2022-05-31 | 成都瑞华康源科技有限公司 | Client-side adaptive response content control system, method and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN111753169B (en) | 2021-10-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7197675B2 (en) | System and method for real-time processing of data streams | |
CN111753169B (en) | Data acquisition system based on internet | |
US10761913B2 (en) | System and method for real-time asynchronous multitenant gateway security | |
US8938533B1 (en) | Automatic capture of diagnostic data based on transaction behavior learning | |
US8108234B2 (en) | System and method for deriving business processes | |
US11734008B1 (en) | Reusable sets of instructions for responding to incidents in information technology environments | |
US10552293B2 (en) | Logging as a service | |
US20230168955A1 (en) | Method and system for processing a stream of incoming messages sent from a specific input message source and validating each incoming message of that stream before sending them to a specific target system | |
US11030384B2 (en) | Identification of sequential browsing operations | |
US20150088772A1 (en) | Enhancing it service management ontology using crowdsourcing | |
US11294740B2 (en) | Event to serverless function workflow instance mapping mechanism | |
US10372572B1 (en) | Prediction model testing framework | |
CN112416708B (en) | Asynchronous call link monitoring method and system | |
US20230259647A1 (en) | Systems and methods for automated discovery and analysis of privileged access across multiple computing platforms | |
US8984124B2 (en) | System and method for adaptive data monitoring | |
CN112084179A (en) | Data processing method, device, equipment and storage medium | |
US20220398239A1 (en) | Intelligent support bundle collection | |
US11897527B2 (en) | Automated positive train control event data extraction and analysis engine and method therefor | |
US20230229659A1 (en) | Estimating query execution performance using a sampled counter | |
CN116233101A (en) | Data acquisition task framework based on HTTP interface hot deployment and use method | |
US11829283B2 (en) | REST Api validation | |
US10698749B1 (en) | System and a method for automated resolution of configuration item issues | |
CN114172749B (en) | Test paper downloading method, device, equipment and storage medium | |
US20230138805A1 (en) | System and Method For Telemetry Data Based Event Occurrence Analysis With Rule Engine | |
US20230091903A1 (en) | Iterative generation of hypertext transfer protocol traffic |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20231017 Address after: Rooms 205-37, 2nd Floor, Building 2, No.1 and No.3, Qinglong Hutong A, Dongcheng District, Beijing, 100007 Patentee after: Beijing Zhongfa zhitou Technology Co.,Ltd. Address before: 100000 floor 21, building a, Chaowai SOHO, No. 6, Chaowai Street, Chaoyang District, Beijing Patentee before: 3GOLDEN (BEIJING) INFORMATION TECHNOLOGY CO.,LTD. |
|
TR01 | Transfer of patent right |