CN111753169A - Data acquisition system based on internet - Google Patents

Data acquisition system based on internet Download PDF

Info

Publication number
CN111753169A
CN111753169A CN202010604543.XA CN202010604543A CN111753169A CN 111753169 A CN111753169 A CN 111753169A CN 202010604543 A CN202010604543 A CN 202010604543A CN 111753169 A CN111753169 A CN 111753169A
Authority
CN
China
Prior art keywords
module
configuration
acquisition
task
interface
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010604543.XA
Other languages
Chinese (zh)
Other versions
CN111753169B (en
Inventor
范晓忻
文章
吴广良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongfa Zhitou Technology Co ltd
Original Assignee
3golden Beijing Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 3golden Beijing Information Technology Co ltd filed Critical 3golden Beijing Information Technology Co ltd
Priority to CN202010604543.XA priority Critical patent/CN111753169B/en
Publication of CN111753169A publication Critical patent/CN111753169A/en
Application granted granted Critical
Publication of CN111753169B publication Critical patent/CN111753169B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/972Access to data in other repository systems, e.g. legacy data or dynamic Web page generation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols

Abstract

The invention relates to a data acquisition system based on the Internet, which comprises an acquisition server and an acquisition client, wherein the acquisition client comprises a plurality of acquisition components, a plurality of pool components and a plurality of interface components; the acquisition assembly comprises a configuration analyzer, a task scheduler, a flow controller, a flow executor and a flow returner; the pool component comprises an acquisition task pool, a configuration pool and an acquisition thread resource pool; the interface component comprises an acquisition task interface, an acquisition configuration interface and a data return interface; wherein at least one of said acquisition components invokes at least one of said pool components via at least one of said interface components. The data acquisition system based on the Internet has the technical problems of good loose coupling performance, high robustness, high acquisition data streaming speed, simple implementation method and the like, and is well applied to multiple fields of finance, manufacturing industry and the like.

Description

Data acquisition system based on internet
Technical Field
The invention belongs to the field of databases for collecting and retrieving from the Internet, and particularly relates to a data collection system based on the Internet.
Background
The known data acquisition system comprises an acquisition client and a server, and when the data acquisition system is constructed, the performance of the data acquisition system has larger difference due to different functions, data flow and main implementation methods of the acquisition system.
Disclosure of Invention
In order to solve the problems, the invention provides an internet-based data acquisition system, which comprises an acquisition server and an acquisition client, wherein the acquisition client comprises a plurality of acquisition components, a plurality of pool components and a plurality of interface components; the acquisition assembly comprises a configuration analyzer, a task scheduler, a flow controller, a flow executor and a flow returner; the pool component comprises an acquisition task pool, a configuration pool and an acquisition thread resource pool; the interface component comprises an acquisition task interface, an acquisition configuration interface and a data return interface; wherein at least one of said acquisition components invokes at least one of said pool components via at least one of said interface components.
The data acquisition system based on the internet has the advantages of being good in loose coupling performance, high in robustness, high in acquisition data streaming speed, simple in implementation method and the like, and is well applied to multiple fields of finance, manufacturing industry and the like.
Drawings
FIG. 1 is a schematic diagram of a system architecture;
FIG. 2 is a schematic diagram of task execution management;
FIG. 3 is a schematic diagram of acquisition tasks;
FIG. 4 is a schematic diagram of an acquisition configuration;
FIG. 5 is a schematic diagram of a data return configuration;
FIG. 6 is a schematic diagram of an acquisition agent configuration;
FIG. 7 is a schematic diagram of a coding arrangement.
Detailed Description
The system architecture of the internet-based data acquisition system of the present invention can be implemented in various ways, and the following exemplary system architecture should not be taken as a specific limitation to the scope of the present invention. In these embodiments, reference is made to FIG. 1The system architecture of the data acquisition system comprises DB storage layers (SDB \ Postgre SQL, SDB API) and MYSQL, and the service logic layers comprise (task queues, task scheduling, task generators, secondary task creation, data cleaning, agent management, log management, verification code scheduling, acquisition client updating, acquisition client state management, acquisition return data receiving), user management (acquisition configuration, task management, task monitoring and acquisition client management), and third-party components (ActiveMQ)TMCouchbase); interface layer (Mina (real-time connection) WebService (call return)); client (a server acquisition point (Python), Windows acquisition Client software (jave, C #)), and IE browser plug-in (C #).
In some embodiments of the internet-based data acquisition system of the present invention, the system comprises an acquisition server and an acquisition client, wherein the acquisition client comprises a plurality of acquisition components, a plurality of pool components and a plurality of interface components; the acquisition assembly comprises a configuration analyzer, a task scheduler, a flow controller, a flow executor and a flow returner; the pool component comprises an acquisition task pool, a configuration pool and an acquisition thread resource pool; the interface component comprises an acquisition task interface, an acquisition configuration interface and a data return interface; wherein at least one of said acquisition components invokes at least one of said pool components via at least one of said interface components.
In some embodiments of the present invention, referring to fig. 2, when the collection client executes a collection task, the collection component is configured as follows:
the configuration parser, namely the configuration parsing main module, wherein a secondary module of the configuration parsing main module comprises a configuration instantiation module, a configuration version comparison module and a configuration pool generation module, wherein the configuration instantiation module is configured to instantiate a configuration to a local disk in a file format and store the configuration in a JSON format file (the configuration comprises configuration parameters required by task execution, such as a request method, a request header, a URL template, whether a verification code is required, a data extraction rule, a process node and the like); the configuration version comparison module is configured to perform configuration comparison on tasks returned from the acquisition server, and the configuration pool generation module is configured to place the unexecuted task configuration into a configuration pool;
the task scheduler is a task scheduling main module, and the task scheduling main module is configured to: acquiring acquisition configuration and flow configuration of a corresponding task from the configuration pool by an acquisition thread through a task scheduling main module, putting the acquisition configuration and the flow configuration into the acquisition thread resource pool, and starting a flow controller; the secondary module of the task scheduling main module comprises a scheduling algorithm module, an acquisition thread starting module, a task interface obtaining module, an acquisition thread resource pool generating module, an acquisition thread resource pool clearing module, an acquisition task pool clearing module and a configuration pool clearing module;
the process controller is a process controller master module configured to: acquiring flow configuration from an acquisition thread resource pool, informing a flow executor to call the execution sequence of the execution node flow according to the execution sequence in the flow configuration, and performing corresponding processing according to the abnormal processing condition of the flow configuration when the execution result is abnormal; the secondary module of the flow controller main module comprises a flow node execution module, a configuration version comparison module and a configuration pool generation module;
the process executor is a process executor main module, and the process executor main module is configured to: executing node services, and sending an execution result to the process controller, wherein the node services include but are not limited to login, page fetching, standardized processing and extraction; the secondary module of the flow executor main module comprises a flow node execution module, an acquisition agent module, a code printing module, a login module, a page fetching module, a standardization module and a digital extraction module, wherein the flow node execution module is configured to input flow node configuration and a previous flow node return value, execute code logic defined by a node and output a current flow node return value;
the flow returner is a flow return main module, and the flow return main module is configured to: receiving the last node information sent by the process controller, and sending the process end information to the task scheduler; the secondary module of the flow return main module comprises a return data generating module and a data calling return interface module.
In some embodiments of the invention, the configuration of the configuration parser, when executed: calling an acquisition configuration interface, obtaining a task to be configured, and then comparing the task with a local configuration file preferentially, if the local configuration version is the same as the process configuration version, putting the local configuration into a configuration pool, and if the versions are different, downloading corresponding configuration from an acquisition server, and then putting the configuration into the configuration pool; the configuration files include, but are not limited to: collecting client ID, collecting client version, maximum task number, collecting task number threshold in a task pool, allowed maximum thread number and collecting client computer configuration information.
In some embodiments of the invention, the configuration of the task scheduler includes, but is not limited to, configuring the latest execution time (absolute value) of the task (in most embodiments of the invention, the execution of the task is of a validity period, such as a task containing session authentication information, which is usually 30 minutes out and is executed in a validity period to make sense), task priority, website collection frequency, and maximum number of threads allowed by the collection client; the scheduling algorithm module is configured to: scheduling consideration factors and task pool collection when executing input, and task ID required to be executed when executing output; the task acquisition interface module is configured to acquire an ID of a collection client when executing input, and to execute a task queue to be executed when executing output; the production acquisition thread resource pool module is configured to read configuration information corresponding to the current acquisition task from the configuration pool and generate an acquisition thread resource pool.
In some embodiments of the present invention, in the configuration of the flow controller, it is necessary to determine whether exception handling is required according to a return execution condition, execute exception handling scheduling if exception handling is required, and determine whether the node is the last node if exception handling is not required; in the process of executing exception handling scheduling, if the exception handling scheduling fails partially, the exception handling scheduling and the execution normal sequential scheduling result are sent to a flow executor, and if the exception handling scheduling and the execution normal sequential scheduling fail completely, the exception handling scheduling and the execution normal sequential scheduling are sent to a flow returner; in the process of judging whether the node is the last node or not, if the node is the last node, the process returner is informed, and if the node is not the last node, normal sequential scheduling is carried out;
when the flow node execution module is executed, the flow exception configuration is carried out when the input is executed, the return value of the previous flow node is returned, and the name of the next node is carried out when the output is executed; when the configuration version comparison module is executed, comparing the task configuration returned from the acquisition server, and if the task configuration is updated, updating the task configuration; and when the module for generating the configuration pool is executed, the task configuration which is not executed is placed into the configuration pool.
In some embodiments of the present invention, in the process of executing the process executor, the process executor executes the program in each process node, all the service resources required by the program are predefined in the collection thread resource pool, and the processing result is written in the collection thread resource pool; the process nodes included in the acquisition client include, but are not limited to: logging in, acquiring a page, standardizing, and extracting data, wherein each process node defines different entry addresses and executes different code blocks.
In some embodiments of the present invention, each time the acquisition client finishes executing an acquisition task, the interface component is configured as follows:
the get collection task interface is configured to: judging the number of tasks in the collection task pool, and executing the task obtaining function to call and obtain a collection task interface when the number of tasks in the collection task pool is less than a certain threshold value; the number of returned tasks is set by the acquisition server;
the pool assembly is configured as follows:
the collection task pool is configured to: storing the tasks acquired from the acquisition server without acquisition;
the configuration pool is configured to: storing all configurations of the tasks which are not collected, including collecting configurations and process configurations;
the collection thread resource pool is configured to: storing all resources required for acquisition, reading required contents from the acquisition tasks by the acquisition tasks, and writing results into the acquisition tasks; the collection thread resource pool is initialized before a single collection task begins.
In some embodiments of the present invention, the collection server includes a server task scheduling main module, a server collected return data processing main module, a server agent verification main module, a server code printing service main module, and a plurality of server interface components;
the at least one server interface is configured to: acquiring an acquisition task, referring to fig. 3, the configuration process is: acquiring a collection task, analyzing interface data (collecting a client ID), reading collection client configuration (collecting the client configuration, namely collecting the client ID, collecting a client version, taking the maximum number of tasks, collecting a number threshold of people in a task pool, allowing the maximum number of threads, collecting client computer configuration information), task scheduling (factors influencing scheduling, such as task absolute time, whether to specify the collection client, priority and the number of the collection client to take the tasks), assembling return data and interface return (task format and interface format), and writing back a task state;
the at least one server interface is configured to: as shown in fig. 4, the acquisition configuration: acquiring configuration, analyzing interface data (configuration ID, task GUID), reading a configuration table, assembling return data and interface return (task format and interface format);
the at least one server interface is configured to: as shown in fig. 5, the data return configuration: data return, interface data analysis, data collection and storage, and (if successful) subtask generation (collection configuration, list subtask and detail subtask);
the at least one server interface is configured to: as shown in fig. 6, the acquisition agent configuration: acquiring an agent, analyzing interface data, taking the agent, assembling return data and interface return, and updating agent information;
the at least one server interface is configured to: as shown in fig. 7, the coding configuration: code scanning, interface data analysis, code printing interface (code printing type) calling, returned data assembly or interface returning, and code printing information recording (code printing information, time, task id, code printing type, code printing picture, returned value).
In some embodiments of the invention, the server task scheduling master module is configured to: overall control is carried out on the collection tasks of the collection clients, and scheduling is carried out according to the latest execution time of the tasks, whether the collection clients are bound or not, the priority and the maximum execution number of the collection clients; the secondary module of the server task scheduling main module comprises a scheduling algorithm module, an acquisition client configuration module and a task write-back state module, wherein when the acquisition client configuration module is executed, the acquisition client configuration module comprises the number of returned tasks;
the server collected return data processing main module is configured to: the acquisition client side returns a result to the acquisition server after acquiring the data, the acquisition server processes the data in the function after receiving the data, and the data can be put in storage or a subtask can be generated according to the situation; the secondary module of the main module for processing the collected return data of the server comprises a collected data analysis module, a subtask generation module, a subtask storage module, a collected data storage module and an abnormal data storage module; when the collected data analysis module is executed, the collected result data comprises success and failure and a success data type, when the subtask generation module is executed, the subtask is a list task or a detail task according to the information in the configuration file, the abnormal data storage module is executed, and when the collected data storage fails, the collected data is placed into an abnormal database.
In some embodiments of the invention, the server agent validation master module is configured to: the module verifies the validity of the agent and removes the invalid agent. The agent maintenance main module comprises an agent verification module, an agent deletion module and an agent information change module, wherein when the agent verification module is executed, the execution input module is used for verifying the agent, the execution output module is used for verifying/invalidating the agent, when the agent deletion module is executed, the invalid agent is removed, and when the agent information change module is executed, the use of the agent by tasks is recorded;
the server coding service primary module is configured to: the code printing service calls an external interface to process the verification code uploaded by the acquisition client, and a secondary module of a code printing service main module comprises a code printing type definition module, an external interface calling module and a code printing information storage module, wherein when the external interface calling module is executed, an invalid agent is removed, and when the code printing information storage module is executed, a task guid, a verification code picture and a code printing result are obtained;
the server data update master module is configured to: according to data which are updated on the Internet irregularly, the acquired data are updated regularly, and the updating is divided into slight updating and complete updating, wherein the slight updating only compares url (the speed is higher), and the complete updating recaptures the data of the formulated website; the secondary module of the server data updating main module comprises a timer module, a slight updating module and a complete updating module, wherein when the timer is executed, the updating operation is executed regularly, when the slight updating module is executed, only the URL is compared, and when the complete updating is executed, the website data is collected again.
In some embodiments of the invention, the interface component is configured to:
get tasks
The acquisition client acquires an acquisition task from the acquisition server through the interface;
acquiring client sending parameters: collecting a client ID;
the collection server returns parameters: the number of tasks and the task set;
data return
The data collected by the collecting client is returned to the collecting server through the interface;
sending parameters: task guid, return identification, data packet (json), source file name, file type;
obtaining configuration
All configuration information is obtained from the interface and comprises login configuration and acquisition configuration (the acquisition configuration comprises flow configuration);
acquiring client sending parameters: configuration guid, configuration type (register acquisition);
the collection server returns parameters: configuration guid, configuration content and configuration type;
acquisition agent
Acquiring client sending parameters: collecting client ID and task GUID;
the collection server returns parameters: the proxy type, proxy ip, port, user name and password;
code printing request
Acquiring client sending parameters: task id, code printing type and code printing picture;
the collection server returns parameters: task id, code value.
Implementations and functional operations of the subject matter described in this specification can be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware, including the structures disclosed in this specification and their structural equivalents, or combinations of more than one of the foregoing. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on one or more tangible, non-transitory program carriers, for execution by, or to control the operation of, data processing apparatus.
Alternatively or in addition, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution with a data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of the foregoing.
A computer program (which may also be referred to or described as a program, software application, module, software module, script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in: in a markup language document; in a single file dedicated to the relevant program; or in multiple coordinated files, such as files that store one or more modules, sub programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
To send interactions with a user, embodiments of the subject matter described in this specification can be implemented on a computer having: a display device, for example, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to a user; and a keyboard and a pointing device, such as a mouse or trackball, by which a user can communicate input to the computer. Other kinds of devices may also be used to send interactions with the user; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic, speech, or tactile input. In addition, the computer may interact with the user by sending documents to a device used by the user and receiving documents from the device; for example, by sending a web page to a web browser on the user's acquisition client device in response to a request received from the web browser.
Implementations of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data acquisition server, or that includes a middleware component, e.g., an application acquisition server, or that includes a front-end component, e.g., an acquisition client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back-end, middleware, or front-end components. The components in the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network ("LAN") and a wide area network ("WAN"), e.g., the Internet. __ the computing system may include a harvesting client and a harvesting server. The acquisition client and acquisition server are typically remote from each other and typically interact through a communication network. The relationship between the acquisition client and the acquisition server is generated using computer programs running on respective computers and having an acquisition client-acquisition server relationship to each other.

Claims (10)

1. A data acquisition system based on the Internet comprises an acquisition server and an acquisition client, and is characterized in that the acquisition client comprises a plurality of acquisition components, a plurality of pool components and a plurality of interface components; the acquisition assembly comprises a configuration analyzer, a task scheduler, a flow controller, a flow executor and a flow returner; the pool component comprises an acquisition task pool, a configuration pool and an acquisition thread resource pool; the interface component comprises an acquisition task interface, an acquisition configuration interface and a data return interface; wherein at least one of said acquisition components invokes at least one of said pool components via at least one of said interface components.
2. The system of claim 1, the acquisition component is configured as follows when the acquisition client performs an acquisition task:
the configuration parser is a configuration parsing main module, and secondary modules of the configuration parsing main module comprise a configuration instantiation module, a configuration version comparison module and a configuration pool generation module, wherein the configuration instantiation module is configured to instantiate a configuration to a local disk in a file format and store the configuration in a JSON format file; the configuration version comparison module is configured to perform configuration comparison on tasks returned from the acquisition server, and the configuration pool generation module is configured to place the unexecuted task configuration into a configuration pool;
the task scheduler is a task scheduling main module, and the task scheduling main module is configured to: acquiring acquisition configuration and flow configuration of a corresponding task from the configuration pool by an acquisition thread through a task scheduling main module, putting the acquisition configuration and the flow configuration into the acquisition thread resource pool, and starting a flow controller; the secondary module of the task scheduling main module comprises a scheduling algorithm module, an acquisition thread starting module, a task interface obtaining module, an acquisition thread resource pool generating module, an acquisition thread resource pool clearing module, an acquisition task pool clearing module and a configuration pool clearing module;
the process controller is a process controller master module configured to: acquiring flow configuration from an acquisition thread resource pool, informing a flow executor to call the execution sequence of the execution node flow according to the execution sequence in the flow configuration, and performing corresponding processing according to the abnormal processing condition of the flow configuration when the execution result is abnormal; the secondary module of the flow controller main module comprises a flow node execution module, a configuration version comparison module and a configuration pool generation module;
the process executor is a process executor main module, and the process executor main module is configured to: executing node services, and sending an execution result to the process controller, wherein the node services include but are not limited to login, page fetching, standardized processing and extraction; the secondary module of the flow executor main module comprises a flow node execution module, an acquisition agent module, a code printing module, a login module, a page fetching module, a standardization module and a digital extraction module, wherein the flow node execution module is configured to input flow node configuration and a previous flow node return value, execute code logic defined by a node and output a current flow node return value;
the flow returner is a flow return main module, and the flow return main module is configured to: receiving the last node information sent by the process controller, and sending the process end information to the task scheduler; the secondary module of the flow return main module comprises a return data generating module and a data calling return interface module.
3. The system of claim 1, wherein the configuration of the configuration parser, when executed: calling an acquisition configuration interface, obtaining a task to be configured, and then comparing the task with a local configuration file preferentially, if the local configuration version is the same as the process configuration version, putting the local configuration into a configuration pool, and if the versions are different, downloading corresponding configuration from an acquisition server, and then putting the configuration into the configuration pool; the configuration files include, but are not limited to: collecting client ID, collecting client version, maximum task number, collecting task number threshold in a task pool, allowed maximum thread number and collecting client computer configuration information.
4. The system of claim 1, wherein the configuration of the task scheduler includes, but is not limited to, configuring the task's latest execution time (absolute value) (in most embodiments of the invention, the execution of the task is of expiration date, such as a task containing session authentication information, which is typically 30 minutes expired and is meaningful if it is executed within validity), task priority, website collection frequency, maximum number of threads allowed by the collection client; the scheduling algorithm module is configured to: scheduling consideration factors and task pool collection when executing input, and task ID required to be executed when executing output; the task acquisition interface module is configured to acquire an ID of a collection client when executing input, and to execute a task queue to be executed when executing output; the production acquisition thread resource pool module is configured to read configuration information corresponding to the current acquisition task from the configuration pool and generate an acquisition thread resource pool.
5. The system of claim 1, wherein the flow controller is configured to determine whether exception handling is required according to a return execution condition, execute exception handling scheduling if the exception handling is required, and determine whether the node is a last node if the exception handling is not required; in the process of executing exception handling scheduling, if the exception handling scheduling fails partially, the exception handling scheduling and the execution normal sequential scheduling result are sent to a flow executor, and if the exception handling scheduling and the execution normal sequential scheduling fail completely, the exception handling scheduling and the execution normal sequential scheduling are sent to a flow returner; in the process of judging whether the node is the last node or not, if the node is the last node, the process returner is informed, and if the node is not the last node, normal sequential scheduling is carried out;
when the flow node execution module is executed, the flow exception configuration is carried out when the input is executed, the return value of the previous flow node is returned, and the name of the next node is carried out when the output is executed; when the configuration version comparison module is executed, comparing the task configuration returned from the acquisition server, and if the task configuration is updated, updating the task configuration; and when the module for generating the configuration pool is executed, the task configuration which is not executed is placed into the configuration pool.
6. The system according to claim 1, wherein in the process of executing the process executor, the process executor executes a program in each process node, all business resources required by the program are predefined in the collection thread resource pool, and processing results are written in the collection thread resource pool; the process nodes included in the acquisition client include, but are not limited to: logging in, acquiring a page, standardizing, and extracting data, wherein each process node defines different entry addresses and executes different code blocks.
7. The system of claim 1, the acquisition client, each time an acquisition task is performed, the interface component is configured as follows:
the get collection task interface is configured to: judging the number of tasks in the collection task pool, and executing the task obtaining function to call and obtain a collection task interface when the number of tasks in the collection task pool is less than a certain threshold value; the number of returned tasks is set by the acquisition server;
the pool assembly is configured as follows:
the collection task pool is configured to: storing the tasks acquired from the acquisition server without acquisition;
the configuration pool is configured to: storing all configurations of the tasks which are not collected, including collecting configurations and process configurations;
the collection thread resource pool is configured to: storing all resources required for acquisition, reading required contents from the acquisition tasks by the acquisition tasks, and writing results into the acquisition tasks; the collection thread resource pool is initialized before a single collection task begins.
8. The system of claim 1, wherein the collection server comprises a server task scheduling main module, a server collected return data processing main module, a server agent verification main module, a server coding service main module and a plurality of server interface components;
the at least one server interface is configured to: acquiring an acquisition task, referring to fig. 3, the configuration process is: acquiring a collection task, analyzing interface data (collecting a client ID), reading collection client configuration (collecting the client configuration, namely collecting the client ID, collecting a client version, taking the maximum number of tasks, collecting a number threshold of people in a task pool, allowing the maximum number of threads, collecting client computer configuration information), task scheduling (factors influencing scheduling, such as task absolute time, whether to specify the collection client, priority and the number of the collection client to take the tasks), assembling return data and interface return (task format and interface format), and writing back a task state;
the at least one server interface is configured to: as shown in fig. 4, the acquisition configuration: acquiring configuration, analyzing interface data (configuration ID, task GUID), reading a configuration table, assembling return data and interface return (task format and interface format);
the at least one server interface is configured to: as shown in fig. 5, the data return configuration: data return, interface data analysis, data collection and storage, and (if successful) subtask generation (collection configuration, list subtask and detail subtask);
the at least one server interface is configured to: as shown in fig. 6, the acquisition agent configuration: acquiring an agent, analyzing interface data, taking the agent, assembling return data and interface return, and updating agent information;
the at least one server interface is configured to: as shown in fig. 7, the coding configuration: code scanning, interface data analysis, code printing interface (code printing type) calling, returned data assembly or interface returning, and code printing information recording (code printing information, time, task id, code printing type, code printing picture, returned value).
9. The system of claim 1, wherein the server task scheduling master module is configured to: overall control is carried out on the collection tasks of the collection clients, and scheduling is carried out according to the latest execution time of the tasks, whether the collection clients are bound or not, the priority and the maximum execution number of the collection clients; the secondary module of the server task scheduling main module comprises a scheduling algorithm module, an acquisition client configuration module and a task write-back state module, wherein when the acquisition client configuration module is executed, the acquisition client configuration module comprises the number of returned tasks;
the server collected return data processing main module is configured to: the acquisition client side returns a result to the acquisition server after acquiring the data, the acquisition server processes the data in the function after receiving the data, and the data can be put in storage or a subtask can be generated according to the situation; the secondary module of the main module for processing the collected return data of the server comprises a collected data analysis module, a subtask generation module, a subtask storage module, a collected data storage module and an abnormal data storage module; when the collected data analysis module is executed, the collected result data comprises success and failure and a success data type, when the subtask generation module is executed, the subtask is a list task or a detail task according to the information in the configuration file, the abnormal data storage module is executed, and when the collected data storage fails, the collected data is placed into an abnormal database.
10. The system of claim 1, wherein the server agent authentication master module is configured to: the module verifies the validity of the agent and removes the invalid agent. The agent maintenance main module comprises an agent verification module, an agent deletion module and an agent information change module, wherein when the agent verification module is executed, the execution input module is used for verifying the agent, the execution output module is used for verifying/invalidating the agent, when the agent deletion module is executed, the invalid agent is removed, and when the agent information change module is executed, the use of the agent by tasks is recorded;
the server coding service primary module is configured to: the code printing service calls an external interface to process the verification code uploaded by the acquisition client, and a secondary module of a code printing service main module comprises a code printing type definition module, an external interface calling module and a code printing information storage module, wherein when the external interface calling module is executed, an invalid agent is removed, and when the code printing information storage module is executed, a task guid, a verification code picture and a code printing result are obtained;
the server data update master module is configured to: according to data which are updated on the Internet irregularly, the acquired data are updated regularly, and the updating is divided into slight updating and complete updating, wherein the slight updating only compares url (the speed is higher), and the complete updating recaptures the data of the formulated website; the secondary module of the server data updating main module comprises a timer module, a slight updating module and a complete updating module, wherein when the timer is executed, the updating operation is executed regularly, when the slight updating module is executed, only the URL is compared, and when the complete updating is executed, the website data is collected again.
CN202010604543.XA 2020-06-29 2020-06-29 Data acquisition system based on internet Active CN111753169B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010604543.XA CN111753169B (en) 2020-06-29 2020-06-29 Data acquisition system based on internet

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010604543.XA CN111753169B (en) 2020-06-29 2020-06-29 Data acquisition system based on internet

Publications (2)

Publication Number Publication Date
CN111753169A true CN111753169A (en) 2020-10-09
CN111753169B CN111753169B (en) 2021-10-19

Family

ID=72677950

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010604543.XA Active CN111753169B (en) 2020-06-29 2020-06-29 Data acquisition system based on internet

Country Status (1)

Country Link
CN (1) CN111753169B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112381503A (en) * 2020-11-06 2021-02-19 上海瀚银信息技术有限公司 Project online optimization management system and method
CN113132383A (en) * 2021-04-19 2021-07-16 烟台中科网络技术研究所 Network data acquisition method and system
CN114567621A (en) * 2022-04-29 2022-05-31 成都瑞华康源科技有限公司 Client-side adaptive response content control system, method and storage medium

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2503733A1 (en) * 2009-12-30 2012-09-26 ZTE Corporation Data collecting method, data collecting apparatus and network management device
CN104063756A (en) * 2014-05-23 2014-09-24 国网辽宁省电力有限公司本溪供电公司 Electric power utilization information remote control system
CN104298550A (en) * 2014-10-09 2015-01-21 南通大学 Hadoop-oriented dynamic scheduling method
CN104468212A (en) * 2014-12-03 2015-03-25 中国科学院计算技术研究所 Cloud computing data center network intelligent linkage configuration method and system
CN104683390A (en) * 2013-11-27 2015-06-03 上海墨芋电子科技有限公司 New technology for improving real estate industry resource sharing technology through cloud computing technology
CN105447088A (en) * 2015-11-06 2016-03-30 杭州掘数科技有限公司 Volunteer computing based multi-tenant professional cloud crawler
CN106936660A (en) * 2015-12-31 2017-07-07 华为软件技术有限公司 Collecting method and device
CN107239558A (en) * 2017-06-09 2017-10-10 成都布林特信息技术有限公司 Common interconnection network collecting method
CN107239563A (en) * 2017-06-13 2017-10-10 成都布林特信息技术有限公司 Public feelings information dynamic monitoring and controlling method
CN108304498A (en) * 2018-01-12 2018-07-20 深圳壹账通智能科技有限公司 Webpage data acquiring method, device, computer equipment and storage medium
CN109299069A (en) * 2018-09-07 2019-02-01 安徽恒科信息技术有限公司 A kind of big data acquisition management platform based on internet data acquisition
CN110765337A (en) * 2019-11-15 2020-02-07 中科院计算技术研究所大数据研究院 Service providing method based on internet big data

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2503733A1 (en) * 2009-12-30 2012-09-26 ZTE Corporation Data collecting method, data collecting apparatus and network management device
US20120297393A1 (en) * 2009-12-30 2012-11-22 Zte Corporation Data Collecting Method, Data Collecting Apparatus and Network Management Device
CN104683390A (en) * 2013-11-27 2015-06-03 上海墨芋电子科技有限公司 New technology for improving real estate industry resource sharing technology through cloud computing technology
CN104063756A (en) * 2014-05-23 2014-09-24 国网辽宁省电力有限公司本溪供电公司 Electric power utilization information remote control system
CN104298550A (en) * 2014-10-09 2015-01-21 南通大学 Hadoop-oriented dynamic scheduling method
CN104468212A (en) * 2014-12-03 2015-03-25 中国科学院计算技术研究所 Cloud computing data center network intelligent linkage configuration method and system
CN105447088A (en) * 2015-11-06 2016-03-30 杭州掘数科技有限公司 Volunteer computing based multi-tenant professional cloud crawler
CN106936660A (en) * 2015-12-31 2017-07-07 华为软件技术有限公司 Collecting method and device
CN107239558A (en) * 2017-06-09 2017-10-10 成都布林特信息技术有限公司 Common interconnection network collecting method
CN107239563A (en) * 2017-06-13 2017-10-10 成都布林特信息技术有限公司 Public feelings information dynamic monitoring and controlling method
CN108304498A (en) * 2018-01-12 2018-07-20 深圳壹账通智能科技有限公司 Webpage data acquiring method, device, computer equipment and storage medium
CN109299069A (en) * 2018-09-07 2019-02-01 安徽恒科信息技术有限公司 A kind of big data acquisition management platform based on internet data acquisition
CN110765337A (en) * 2019-11-15 2020-02-07 中科院计算技术研究所大数据研究院 Service providing method based on internet big data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王正宏: "区域医疗数据采集方法优化", 《电子技术与软件工程》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112381503A (en) * 2020-11-06 2021-02-19 上海瀚银信息技术有限公司 Project online optimization management system and method
CN113132383A (en) * 2021-04-19 2021-07-16 烟台中科网络技术研究所 Network data acquisition method and system
CN113132383B (en) * 2021-04-19 2022-03-25 烟台中科网络技术研究所 Network data acquisition method and system
CN114567621A (en) * 2022-04-29 2022-05-31 成都瑞华康源科技有限公司 Client-side adaptive response content control system, method and storage medium

Also Published As

Publication number Publication date
CN111753169B (en) 2021-10-19

Similar Documents

Publication Publication Date Title
JP7197675B2 (en) System and method for real-time processing of data streams
CN111753169B (en) Data acquisition system based on internet
US10761913B2 (en) System and method for real-time asynchronous multitenant gateway security
US8938533B1 (en) Automatic capture of diagnostic data based on transaction behavior learning
US8108234B2 (en) System and method for deriving business processes
US11734008B1 (en) Reusable sets of instructions for responding to incidents in information technology environments
US10552293B2 (en) Logging as a service
US20230168955A1 (en) Method and system for processing a stream of incoming messages sent from a specific input message source and validating each incoming message of that stream before sending them to a specific target system
US11030384B2 (en) Identification of sequential browsing operations
US20150088772A1 (en) Enhancing it service management ontology using crowdsourcing
US11294740B2 (en) Event to serverless function workflow instance mapping mechanism
US10372572B1 (en) Prediction model testing framework
CN112416708B (en) Asynchronous call link monitoring method and system
US20230259647A1 (en) Systems and methods for automated discovery and analysis of privileged access across multiple computing platforms
US8984124B2 (en) System and method for adaptive data monitoring
CN112084179A (en) Data processing method, device, equipment and storage medium
US20220398239A1 (en) Intelligent support bundle collection
US11897527B2 (en) Automated positive train control event data extraction and analysis engine and method therefor
US20230229659A1 (en) Estimating query execution performance using a sampled counter
CN116233101A (en) Data acquisition task framework based on HTTP interface hot deployment and use method
US11829283B2 (en) REST Api validation
US10698749B1 (en) System and a method for automated resolution of configuration item issues
CN114172749B (en) Test paper downloading method, device, equipment and storage medium
US20230138805A1 (en) System and Method For Telemetry Data Based Event Occurrence Analysis With Rule Engine
US20230091903A1 (en) Iterative generation of hypertext transfer protocol traffic

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20231017

Address after: Rooms 205-37, 2nd Floor, Building 2, No.1 and No.3, Qinglong Hutong A, Dongcheng District, Beijing, 100007

Patentee after: Beijing Zhongfa zhitou Technology Co.,Ltd.

Address before: 100000 floor 21, building a, Chaowai SOHO, No. 6, Chaowai Street, Chaoyang District, Beijing

Patentee before: 3GOLDEN (BEIJING) INFORMATION TECHNOLOGY CO.,LTD.

TR01 Transfer of patent right