CN112307046A

CN112307046A - Data acquisition method and device, computer readable storage medium and electronic equipment

Info

Publication number: CN112307046A
Application number: CN202011345192.1A
Authority: CN
Inventors: 徐鸣辉
Original assignee: Beijing Jindi Credit Service Co ltd
Current assignee: Beijing Jindi Credit Service Co ltd
Priority date: 2020-11-26
Filing date: 2020-11-26
Publication date: 2021-02-02

Abstract

The embodiment of the disclosure discloses a data acquisition method and device, a computer readable storage medium and an electronic device, wherein the method comprises the following steps: determining at least one task to be processed which needs to be updated, and storing the at least one task to be processed into at least one cache queue; acquiring n to-be-processed tasks from the at least one cache queue, and processing the acquired n to-be-processed tasks to obtain updated data; wherein n is an integer greater than or equal to 1; analyzing the obtained updated data to obtain structured data and storing the structured data; by processing the n to-be-processed tasks simultaneously, the data acquisition efficiency is improved.

Description

Data acquisition method and device, computer readable storage medium and electronic equipment

Technical Field

The present disclosure relates to data acquisition technologies, and in particular, to a data acquisition method and apparatus, a computer-readable storage medium, and an electronic device.

Background

The architecture design is important when large-scale data is captured, and the existing architecture is either too targeted or too general and can involve many other problems, such as back-crawl strategy handling, agent use design, repeated capture filtering and the like. Open-source distributed crawlers are also common, and the defects are that the customization is poor, particularly the management of clusters and the utilization of server resources are very weak, and the variable crawling requirements can not be well met and a large amount of data processing can not be provided.

Disclosure of Invention

The present disclosure is proposed to solve the above technical problems. The embodiment of the disclosure provides a data acquisition method and device, a computer-readable storage medium and electronic equipment.

According to an aspect of an embodiment of the present disclosure, there is provided a data acquisition method including:

determining at least one task to be processed which needs to be updated, and storing the at least one task to be processed into at least one cache queue;

acquiring n to-be-processed tasks from the at least one cache queue, and processing the acquired n to-be-processed tasks to obtain updated data; wherein n is an integer greater than or equal to 1;

and analyzing the obtained updated data to obtain structured data and storing the structured data.

Optionally, the determining at least one to-be-processed task that needs to be updated, and storing the at least one to-be-processed task in at least one buffer queue includes:

determining at least one task meeting an updating condition from a task set as the task to be processed; wherein the task set comprises a plurality of tasks;

distributing the at least one task to be processed to the at least one buffer queue according to the task type of the at least one task to be processed; and each buffer queue corresponds to one task type.

Optionally, the update condition includes at least one of:

in response to receiving an external update request to update the at least one task;

responding to the time difference between the current time and the historical updating time of the task reaching the preset time difference.

Optionally, each task to be processed corresponds to a different preset priority;

the allocating the at least one task to be processed to the at least one buffer queue according to the task type of the at least one task to be processed includes:

sequencing the at least one task to be processed belonging to the same task type according to the preset priority;

and storing the at least one task to be processed into the cache queue corresponding to the task type according to the sorting sequence.

Optionally, the obtaining n to-be-processed tasks from the at least one buffer queue, and processing the obtained n to-be-processed tasks to obtain update data includes:

synchronously acquiring n tasks to be processed;

acquiring n proxy services from a proxy pool based on the n to-be-processed tasks; wherein the proxy pool at least comprises n proxy services;

and respectively acquiring the updating data corresponding to the n tasks to be processed from different paths based on the n proxy services.

Optionally, the synchronously acquiring n to-be-processed tasks includes:

n working units are established in a working pool, and the n working units are scanned at intervals of set time so that the n working units in the working pool can work normally;

and simultaneously acquiring n tasks to be processed through n working units in the working pool.

Optionally, the obtaining, based on the n proxy services, update data corresponding to the n to-be-processed tasks from different paths respectively includes:

determining at least one keyword based on the n tasks to be processed; each task to be processed corresponds to at least one keyword;

and each agent service in the n agent services respectively acquires update data based on at least one keyword corresponding to the task to be processed.

Optionally, before obtaining the update data corresponding to the n to-be-processed tasks from different paths based on the n proxy services, the method further includes:

acquiring the validity period of each proxy service in the n proxy services;

determining whether each of the n proxy services is valid based on the validity period;

in response to each of the n proxy services being active, obtaining update data with the n proxy services;

and in response to that at least one proxy service in the n proxy services is invalid, re-acquiring a new proxy service to replace the invalid proxy service, so as to acquire the updated data by the updated n proxy services.

Optionally, before performing parsing processing on the obtained update data to obtain structured data and storing the structured data, the method further includes:

storing the updated data into a first database;

analyzing the obtained updated data to obtain structured data and storing the structured data, wherein the analyzing comprises the following steps:

and carrying out structuralization processing on the updated data in the first database, and storing the obtained structuralization data into a second database.

Optionally, the method further comprises:

obtaining an updating result of the at least one task to be processed; wherein the updating result comprises updating success, updating failure and updating exception;

and counting the times of successful updating, the times of failed updating and the times of abnormal updating, and storing the counting result.

Optionally, the method further comprises:

and displaying the statistical result, and sending out alarm information when the abnormal updating times in the statistical result exceed a set proportion in the number of the tasks to be processed.

According to another aspect of the embodiments of the present disclosure, there is provided a data acquisition apparatus including:

the scheduling module is used for determining at least one task to be processed which needs to be updated and storing the at least one task to be processed into at least one cache queue;

the acquisition module is used for acquiring n to-be-processed tasks from the at least one cache queue and processing the acquired n to-be-processed tasks to obtain updated data; wherein n is an integer greater than or equal to 1;

and the analysis module is used for analyzing the obtained updated data to obtain structured data and storing the structured data.

Optionally, the scheduling module, the collecting module and the parsing module are respectively executed in at least one container;

further comprising:

and the cluster management module is used for managing the containers.

Optionally, the scheduling module is specifically configured to determine, from a task set, at least one task that meets an update condition as the to-be-processed task; wherein the task set comprises a plurality of tasks; distributing the at least one task to be processed to the at least one buffer queue according to the task type of the at least one task to be processed; and each buffer queue corresponds to one task type.

Optionally, the update condition includes at least one of:

the scheduling module is used for sequencing the at least one task to be processed belonging to the same task type according to the preset priority when the at least one task to be processed is allocated to the at least one cache queue according to the task type of the at least one task to be processed; and storing the at least one task to be processed into the cache queue corresponding to the task type according to the sorting sequence.

Optionally, the acquisition module is specifically configured to acquire n to-be-processed tasks synchronously; acquiring n proxy services from a proxy pool based on the n to-be-processed tasks; wherein the proxy pool at least comprises n proxy services; and respectively acquiring the updating data corresponding to the n tasks to be processed from different paths based on the n proxy services.

Optionally, the acquisition module is configured to establish n working units in a working pool when n to-be-processed tasks are synchronously acquired, and scan the n working units at set intervals, so that the n working units in the working pool can work normally; and simultaneously acquiring n tasks to be processed through n working units in the working pool.

Optionally, the acquisition module is configured to determine at least one keyword based on the n to-be-processed tasks when acquiring update data corresponding to the n to-be-processed tasks from different paths based on the n proxy services, respectively; each task to be processed corresponds to at least one keyword; and each agent service in the n agent services respectively acquires update data based on at least one keyword corresponding to the task to be processed.

Optionally, the acquisition module is further configured to acquire a validity period of each proxy service in the n proxy services; determining whether each of the n proxy services is valid based on the validity period; in response to each of the n proxy services being active, obtaining update data with the n proxy services; and in response to that at least one proxy service in the n proxy services is invalid, re-acquiring a new proxy service to replace the invalid proxy service, so as to acquire the updated data by the updated n proxy services.

Optionally, the apparatus further comprises:

the storage module is used for storing the updated data into a first database and storing the structured data obtained by the analysis module into a second database;

the analysis module is specifically configured to perform structured processing on the updated data in the first database to obtain the structured data.

Optionally, the apparatus further comprises:

the statistical module is used for acquiring an updating result of the at least one task to be processed; wherein the updating result comprises updating success, updating failure and updating exception; and counting the times of successful updating, the times of failed updating and the times of abnormal updating, and storing the counting result.

Optionally, the apparatus further comprises:

and the monitoring module is used for displaying the statistical result and sending out alarm information when the abnormal updating times in the statistical result exceed a set proportion in the number of the tasks to be processed.

According to yet another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the data acquisition method of any of the embodiments.

According to still another aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instruction from the memory and execute the instruction to implement the data acquisition method according to any of the above embodiments.

Based on the data acquisition method and device, the computer-readable storage medium and the electronic device provided by the above embodiments of the present disclosure, at least one to-be-processed task that needs to be updated is determined, and the at least one to-be-processed task is stored in at least one cache queue; acquiring n to-be-processed tasks from the at least one cache queue, and processing the acquired n to-be-processed tasks to obtain updated data; wherein n is an integer greater than or equal to 1; analyzing the obtained updated data to obtain structured data and storing the structured data; the data to be processed is distributed to the at least one cache queue through the container technology, and the at least one cache queue is processed at the same time, so that n tasks to be processed are processed at the same time, the data acquisition efficiency is improved, and the resource use requirements of conflicts among the cache queues are balanced.

The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in more detail embodiments of the present disclosure with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. In the drawings, like reference numbers generally represent like parts or steps.

Fig. 1 is a schematic flow chart diagram of a data acquisition method according to an exemplary embodiment of the present disclosure.

FIG. 2 is a schematic flow chart of step 102 in the embodiment shown in FIG. 1 of the present disclosure.

Fig. 3 is a schematic flow chart of step 104 in the embodiment shown in fig. 1 of the present disclosure.

Fig. 4 is a schematic flow chart of a data acquisition method according to another exemplary embodiment of the present disclosure.

Fig. 5 is a schematic structural diagram of a data acquisition device according to an exemplary embodiment of the present disclosure.

Fig. 6 is a schematic structural diagram of a cluster management module in a data acquisition device according to an alternative example of the present disclosure.

Fig. 7 is a schematic diagram of a container structure in a data acquisition device according to an alternative example of the present disclosure.

Fig. 8 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.

Detailed Description

Hereinafter, example embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.

It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one element from another, and are not intended to imply any particular technical meaning, nor is the necessary logical order between them.

It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more and "at least one" may refer to one, two or more.

It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.

In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing an associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

The disclosed embodiments may be applied to electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with electronic devices, such as terminal devices, computer systems, servers, and the like, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set top boxes, programmable consumer electronics, network pcs, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above systems, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

Exemplary method

Fig. 1 is a schematic flow chart diagram of a data acquisition method according to an exemplary embodiment of the present disclosure. The embodiment can be applied to a cloud platform, and optionally, a plurality of containers can be arranged on the cloud platform, as shown in fig. 1, including the following steps:

step 102, determining at least one to-be-processed task needing to be updated, and storing the at least one to-be-processed task into at least one buffer queue.

Optionally, the task to be processed in this embodiment may be any data or data set that needs to be updated, and when the time from the last update of the task reaches the update cycle, it may be determined that the task to be processed needs to be updated; or when some tasks are required to be updated by an external request, the tasks are taken as the tasks to be processed which need to be updated.

And 104, acquiring n to-be-processed tasks from at least one cache queue, and processing the acquired n to-be-processed tasks to obtain updated data.

Wherein n is an integer of 1 or more.

In one embodiment, n to-be-processed tasks are simultaneously processed through a distributed system, multi-thread simultaneous data acquisition is realized, and the processing speed is improved.

And 106, analyzing the obtained updated data to obtain structured data and storing the structured data.

Optionally, after the data update of at least one to-be-processed task is completed, the update completion information is fed back to the module storing all to-be-processed tasks, and the update time corresponding to the corresponding to-be-processed task is updated, so that it is ensured that the recently updated to-be-processed task is not repeatedly read when the to-be-processed task is acquired next time.

Based on the data acquisition method provided by the above embodiment of the present disclosure, at least one to-be-processed task that needs to be updated is determined, and the at least one to-be-processed task is stored in at least one cache queue; acquiring n to-be-processed tasks from the at least one cache queue, and processing the acquired n to-be-processed tasks to obtain updated data; wherein n is an integer greater than or equal to 1; analyzing the obtained updated data to obtain structured data and storing the structured data; the data to be processed is distributed to the at least one cache queue through the container technology, and the at least one cache queue is processed at the same time, so that n tasks to be processed are processed at the same time, the data acquisition efficiency is improved, and the resource use requirements of conflicts among the cache queues are balanced.

The container technology comprises the following steps: the technology of effectively dividing resources of a single operating system into isolated groups so as to better balance conflicting resource usage requirements among the isolated groups is container technology.

As shown in fig. 2, based on the embodiment shown in fig. 1, step 102 may include the following steps:

and 1021, determining at least one task meeting the updating condition from the task set as a task to be processed.

Wherein the task set comprises a plurality of tasks.

Optionally, the update condition comprises at least one of:

in response to receiving an external update request to update at least one task;

and responding to the time difference between the current time and the historical updating time of the task to reach the preset time difference.

Wherein, the external request can be received through an externally exposed application programming interface (api), and a user can modify any task and scheduling rule through the api.

Step 1022, allocating the at least one pending task to the at least one cache queue according to the task type of the at least one pending task.

Wherein each buffer queue corresponds to a task type.

In this embodiment, the obtained at least one to-be-processed task is classified according to task types, and the at least one to-be-processed task corresponding to each task type is allocated to a cache queue (e.g., redis) corresponding to the task type, where the task types may be classified according to the source of data update, for example, data from a business website belongs to one classification, data from a customs website belongs to another classification, or classification is performed according to different regions of origin, for example, data from a hong kong region belongs to one classification, data from a continent region belongs to another classification, and so on. The embodiment can save each task and the corresponding update history and update time in a preset scheduling table, and obtain the to-be-processed task to be updated by polling and searching whether the time difference between the update time and the current time is the update time or not according to the scheduling table; in addition, in order to ensure that the data acquisition can always obtain the tasks to be processed, when the data of the tasks to be processed in the cache queue is less than a certain proportion of the data which can be cached in the cache queue, the tasks to be processed are supplemented to the cache queue in time; after each task to be processed is processed, processing feedback is received, when the feedback processing result is that updating is successful, the task to be processed is deleted from the cache queue, and the updating result and the updating time of the task to be processed are updated in the scheduling table; when the feedback processing result is that the updating fails, the priority of the task to be processed is reduced, and the task to be processed is added into the corresponding buffer queue again or is added into other buffer queues (when one task to be processed does not obtain the updating data in the data obtaining path of one task type, other data can be tried to obtain the image to obtain the updating data).

step 1022 may include:

at least one task to be processed belonging to the same task type is sequenced according to a preset priority;

and storing at least one task to be processed into a buffer queue corresponding to the task type according to the sequencing sequence.

In the embodiment, at least one task to be processed is sequenced according to the preset priority of each task to be processed, so that the task to be processed with higher priority can be processed firstly and the task with lower priority can be processed later when the task to be processed is processed, the processing efficiency is improved, and the processing timeliness of important tasks is ensured.

As shown in fig. 3, based on the embodiment shown in fig. 1, step 104 may include the following steps:

step 1041, acquiring n tasks to be processed synchronously.

Step 1042, acquiring n proxy services from the proxy pool based on the n to-be-processed tasks.

Wherein, the agent pool at least comprises n agent services.

Step 1043, obtaining the update data corresponding to the n to-be-processed tasks from different paths based on the n proxy services, respectively.

In this embodiment, acquiring the update data through proxy services (the proxy services may include ip address proxies, etc.), so as to avoid a risk of directly acquiring the data being shielded, where each proxy service has a set validity period, and before acquiring the update data corresponding to n to-be-processed tasks from different paths based on n proxy services, the method further includes:

acquiring the validity period of each proxy service in the n proxy services; determining whether each of the n proxy services is valid based on the validity period;

alternatively, it is determined whether the proxy service is valid by identifying whether the proxy service is still within the validity period, the proxy service within the validity period is valid, and the proxy service beyond the validity period is invalid.

Acquiring update data with the n proxy services in response to each of the n proxy services being valid;

and in response to the fact that at least one proxy service in the n proxy services is invalid, re-acquiring a new proxy service to replace the invalid proxy service, so that the updated n proxy services acquire the update data.

In this embodiment, when the proxy service is invalid, a new proxy service needs to be acquired again to ensure that n valid proxy services are available, so as to determine that n tasks to be processed can be processed simultaneously.

Optionally, step 1041 in the above embodiment may include:

In this embodiment, each work unit (work) in a work pool (which may be a multithread in a distributed system) is scanned at intervals of a set time (e.g., 5 seconds, etc.), when there is a task that cannot be scheduled by a work unit, the work unit is considered to be dead, the work unit is pulled up (or deleted), and a new work unit is added in the work pool, so that n work units are ensured to schedule a task to be processed at the same time, and each work unit schedules the task to be processed in sequence (one task is processed and then schedules the next task), and when the task to be processed by the work unit is empty, the work unit sleeps for a set time length (e.g., 1 second, etc.), and then acquires the next task to be processed in the cache queue again.

Optionally, step 1043 in the above embodiment may include:

and each agent service in the n agent services respectively acquires the updating data based on at least one keyword corresponding to the task to be processed.

In this embodiment, data capture is achieved through the proxy service, specifically, at least one keyword needs to be obtained from data of a task to be processed, searching is performed based on the keyword, data capture is achieved by the obtained at least one keyword, when data is captured by the keyword, the task to be processed is considered to be successfully updated, when data is not captured by the keyword, the task to be processed is considered to be failed to be updated, and a result of success or failure in updating is fed back to the scheduling table.

Fig. 4 is a schematic flow chart of a data acquisition method according to another exemplary embodiment of the present disclosure. As shown in fig. 4, the method comprises the following steps:

step 402, determining at least one to-be-processed task needing to be updated, and storing the at least one to-be-processed task into at least one buffer queue.

Step 404, acquiring n to-be-processed tasks from at least one buffer queue, and processing the acquired n to-be-processed tasks to obtain updated data.

Wherein n is an integer of 1 or more.

Step 406, the updated data is stored in the first database.

And step 408, performing structuring processing on the updated data in the first database, and storing the obtained structured data into the second database.

Since the data acquired from the network is unstructured data (e.g., web pages, etc.), the unstructured data needs to be parsed to obtain structured data, and the parsing process may further include, but is not limited to, processing such as format modification, so as to count and display the acquired structured data; in addition, in order to trace the source and return the operation in time when the data has a problem or is processed to have a problem, the data collected by each updating operation can be stored through a single database (e.g., mongodb); in this embodiment, the parsed structured data is stored in a second database (e.g., musql), and when the display and statistics are required, the display information can be directly obtained from the second database.

In some optional embodiments, the method provided in this embodiment further includes:

and obtaining an updating result of at least one task to be processed.

The updating result comprises updating success, updating failure and updating exception.

Alternatively, the update exception may include, for example, a web page being inaccessible, being considered illegitimate, etc.

After the analysis of the update data is completed, the present embodiment counts and stores the update result to provide an adjustment basis for the later analysis and policy making, for example, analyzing the reason of the abnormality, analyzing the reason of the update failure, and adjusting the update process according to the analysis result.

and displaying the statistical result, and sending out alarm information when the abnormal updating times in the statistical result exceed the set proportion in the number of the tasks to be processed.

The embodiment provides visual statistical result display, facilitates developers to visually observe whether various indexes are normal or not, and sends out alarm information to remind the developers to intervene in time when the abnormal updating times exceed the set proportion in the number of the tasks to be processed.

Any of the data acquisition methods provided by the embodiments of the present disclosure may be performed by any suitable device having data processing capabilities, including but not limited to: terminal equipment, a server and the like. Alternatively, any of the data acquisition methods provided by the embodiments of the present disclosure may be executed by a processor, for example, the processor may execute any of the data acquisition methods mentioned in the embodiments of the present disclosure by calling a corresponding instruction stored in a memory. And will not be described in detail below.

Exemplary devices

Fig. 5 is a schematic structural diagram of a data acquisition device according to an exemplary embodiment of the present disclosure. As shown in fig. 5, the apparatus provided in this embodiment includes:

the scheduling module 51 is configured to determine at least one to-be-processed task that needs to be updated, and store the at least one to-be-processed task in at least one buffer queue.

The acquisition module 52 is configured to acquire n to-be-processed tasks from at least one buffer queue, and process the acquired n to-be-processed tasks to obtain update data.

Wherein n is an integer of 1 or more.

And the analysis module 53 is configured to perform analysis processing on the obtained updated data to obtain and store structured data.

Based on the data acquisition device provided by the above embodiment of the present disclosure, at least one to-be-processed task that needs to be updated is determined, and the at least one to-be-processed task is stored in at least one cache queue; acquiring n to-be-processed tasks from the at least one cache queue, and processing the acquired n to-be-processed tasks to obtain updated data; wherein n is an integer greater than or equal to 1; analyzing the obtained updated data to obtain structured data and storing the structured data; the data to be processed is distributed to the at least one cache queue through the container technology, and the at least one cache queue is processed at the same time, so that n tasks to be processed are processed at the same time, the data acquisition efficiency is improved, and the resource use requirements of conflicts among the cache queues are balanced.

Optionally, the scheduling module, the collecting module and the analyzing module are respectively executed in at least one container;

further comprising:

and the cluster management module is used for managing the containers.

Optionally, in an optional example, the cluster management module is as shown in fig. 6, where etcd is a highly available Key/Value storage system, and is mainly used for sharing configuration and service discovery; container manager is the container manager; the api server is an api server; scheduler is the scheduler; docker is an open source application container engine; the fluent is an open-source general log acquisition and distribution system, can acquire logs from a plurality of data sources, and distributes the logs to a plurality of storage and processing systems after filtering and processing; the node represents a node; the Kubernetes (container arrangement engine) is used for cluster management, system maintenance work such as automatic deployment, automatic expansion/expansion, automatic restart, load balancing and the like can be provided, and the Kubernetes is the basis of a distributed system. The device provided by the embodiment of the application runs in a cloud platform, the cloud platform comprises a plurality of containers, each container can run a module in a data acquisition device or partial functions of the module (the modules comprise a scheduling module, an acquisition module and an analysis module, and further comprise a storage module, a statistical module, a monitoring module and an agent module for maintaining an agent pool), the change (increase, decrease and migration) of the structural functions of the device is more convenient by running the module functions in the containers, the cost of the change is greatly reduced by taking the containers as units, and when partial functions need to be added, only one container with the same function needs to be copied.

Each server program is deployed on a Docker container (equivalent to a virtual host, which can be used for placing and executing arbitrary programs), in an optional example, a schematic structural diagram of the container is shown in fig. 7, a Docker Daemon (container Daemon) can accept a request of a Docker Client (container Client), then an Engine (Engine) executes a task inside the Docker in a jobformat, and each jobfile runs independently

Kubernetes makes deployment extremely simple, business codes are managed by using Git (an open source distributed version control system), only the codes need to be pulled down from a certain server pull, and Docker packing items are transmitted to a private warehouse (the private warehouse has the function of packing and managing data in a container, so that transmission and historical versions are convenient to view), and the Docker packing process can be integrated in a development tool and then released by one key on a management platform.

And the cluster management platform also provides a visual interface and an abnormal alarm. The container state and the server resources are visualized and managed by the platform and are provided for developers, the operation of the server can be adjusted to ensure the normal operation of the container, and the newly added container

Optionally, the scheduling module 51 is specifically configured to determine, from the task set, at least one task that meets the update condition as a task to be processed; wherein the task set comprises a plurality of tasks; allocating at least one task to be processed to at least one buffer queue according to the task type of the at least one task to be processed; wherein each buffer queue corresponds to a task type.

Optionally, the update condition comprises at least one of:

the scheduling module 51 is configured to sort at least one to-be-processed task belonging to the same task type according to a preset priority when the at least one to-be-processed task is allocated to at least one buffer queue according to the task type of the at least one to-be-processed task; and storing at least one task to be processed into a buffer queue corresponding to the task type according to the sequencing sequence.

Optionally, the acquisition module 52 is specifically configured to acquire n to-be-processed tasks synchronously; acquiring n proxy services from a proxy pool based on the n tasks to be processed; the proxy pool at least comprises n proxy services; and respectively acquiring the updating data corresponding to the n tasks to be processed from different paths based on the n proxy services.

Optionally, the acquisition module 52 is configured to establish n working units in the working pool when n tasks to be processed are synchronously acquired, and scan the n working units at a set time interval, so that the n working units in the working pool can work normally; and simultaneously acquiring n tasks to be processed through n working units in the working pool.

Optionally, the acquisition module is configured to determine at least one keyword based on the n to-be-processed tasks when acquiring update data corresponding to the n to-be-processed tasks from different paths based on the n proxy services, respectively; each task to be processed corresponds to at least one keyword; and each agent service in the n agent services respectively acquires the updating data based on at least one keyword corresponding to the task to be processed.

Optionally, the acquisition module is further configured to acquire a validity period of each proxy service in the n proxy services; determining whether each of the n proxy services is valid based on the validity period; acquiring update data with the n proxy services in response to each of the n proxy services being valid; and in response to the fact that at least one proxy service in the n proxy services is invalid, re-acquiring a new proxy service to replace the invalid proxy service, so that the updated n proxy services acquire the update data.

Optionally, the apparatus provided in this embodiment of the present application further includes:

the storage module is used for storing the updated data into a first database and storing the structured data obtained by the analysis module 53 into a second database;

the parsing module 53 is specifically configured to perform a structuring process on the updated data in the first database to obtain structured data.

the statistical module is used for acquiring an updating result of at least one task to be processed; wherein the updating result comprises updating success, updating failure and updating exception; and counting the times of successful updating, the times of failed updating and the times of abnormal updating, and storing the counting result.

and the monitoring module is used for displaying the statistical result and sending out alarm information when the abnormal updating times in the statistical result exceed the set proportion in the number of the tasks to be processed.

Exemplary electronic device

Next, an electronic apparatus according to an embodiment of the present disclosure is described with reference to fig. 8. The electronic device may be either or both of the first device 100 and the second device 200, or a stand-alone device separate from them that may communicate with the first device and the second device to receive the collected input signals therefrom.

FIG. 8 illustrates a block diagram of an electronic device in accordance with an embodiment of the disclosure.

As shown in fig. 8, the electronic device 80 includes one or more processors 81 and memory 82.

The processor 81 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 80 to perform desired functions.

Memory 82 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by the processor 81 to implement the data acquisition methods of the various embodiments of the present disclosure described above and/or other desired functions. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device 80 may further include: an input device 83 and an output device 84, which are interconnected by a bus system and/or other form of connection mechanism (not shown).

For example, when the electronic device is the first device 100 or the second device 200, the input device 83 may be a microphone or a microphone array as described above for capturing an input signal of a sound source. When the electronic device is a stand-alone device, the input means 83 may be a communication network connector for receiving the acquired input signals from the first device 100 and the second device 200.

The input device 83 may also include, for example, a keyboard, a mouse, and the like.

The output device 84 may output various information including the determined distance information, direction information, and the like to the outside. The output devices 84 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, among others.

Of course, for simplicity, only some of the components of the electronic device 80 relevant to the present disclosure are shown in fig. 8, omitting components such as buses, input/output interfaces, and the like. In addition, the electronic device 80 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the methods and apparatus described above, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the data acquisition method according to various embodiments of the present disclosure described in the "exemplary methods" section above of this specification.

The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform steps in a data collection method according to various embodiments of the present disclosure described in the "exemplary methods" section above of this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A method of data acquisition, comprising:

2. The method of claim 1, wherein determining at least one pending task that needs to be updated and storing the at least one pending task in at least one buffer queue comprises:

3. The method of claim 2, wherein the update condition comprises at least one of:

4. The method according to claim 2 or 3, wherein each of the tasks to be processed corresponds to a different preset priority;

5. The method according to any one of claims 1 to 4, wherein the obtaining n pending tasks from the at least one buffer queue and processing the obtained n pending tasks to obtain the update data comprises:

acquiring n tasks to be processed;

6. The method of claim 5, wherein the obtaining n of the tasks to be processed comprises:

7. The method according to claim 5 or 6, wherein the obtaining of the update data corresponding to the n pending tasks from different paths based on the n proxy services respectively comprises:

8. The method according to any one of claims 5 to 7, further comprising, before obtaining the update data corresponding to the n pending tasks from different paths based on the n proxy services, respectively:

acquiring the validity period of each proxy service in the n proxy services;

9. The method according to any one of claims 1 to 8, wherein before parsing the obtained updated data to obtain structured data and storing the structured data, the method further comprises:

storing the updated data into a first database;

10. The method of any of claims 1-9, further comprising:

11. The method of claim 10, further comprising:

12. A data acquisition device, comprising:

13. The apparatus of claim 12, wherein the scheduling module, the acquisition module, and the parsing module are each executed in at least one container;

further comprising:

and the cluster management module is used for managing the containers.

14. A computer-readable storage medium, characterized in that the storage medium stores a computer program for executing the data acquisition method of any of the preceding claims 1-11.

15. An electronic device, characterized in that the electronic device comprises:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the data acquisition method of any one of claims 1 to 11.