WO2019223599A1

WO2019223599A1 - Data acquisition system and method, node device and storage medium

Info

Publication number: WO2019223599A1
Application number: PCT/CN2019/087226
Authority: WO
Inventors: 郭峰
Original assignee: 杭州海康威视数字技术股份有限公司
Priority date: 2018-05-25
Filing date: 2019-05-16
Publication date: 2019-11-28
Also published as: CN110597890B; CN110597890A

Abstract

The present application relates to the technical field of big data, and disclosed thereby are a data acquisition system and method, a node device and a storage medium. A first data processing module in the data collection system provided by the present application is used for instructing a data source to provide the next batch of data when acquiring any batch of data of the data source; any data processing module in the data collection system is used for receiving the next batch of data while executing a corresponding data processing operation on the received any batch of data; and the last data processing module in the data collection system is used for storing the processed data in a first storage source. In the present application, a fully asynchronous system architecture is designed to ensure that the data acquisition system may process multiple batches of data at the same time, thereby avoiding the situation of a data processing module having to wait for another data processing module to complete a data processing operation before starting to process data, thus improving the data acquisition efficiency.

Description

Data acquisition system, method, node equipment and storage medium

This application claims priority from a Chinese patent application filed on May 25, 2018 with application number 201810515496.4 and the invention name is "Data Acquisition System, Method, Node Device, and Storage Medium", the entire contents of which are incorporated herein by reference. in.

Technical field

The present application relates to the field of big data technology, and in particular, to a data collection system, method, node device, and storage medium.

Background technique

With the development of big data technology and the rapid growth of massive data in the network, the challenge of data collection has become more prominent. Data collection refers to the process of processing data in a data source through a series of processing operations, and finally storing the processed data to a storage source. Data collection can help people manage, analyze, and mine data, and it has great economic and application value.

At present, data acquisition systems usually adopt a single-machine multi-threaded architecture and perform data acquisition based on a synchronous method: The data acquisition system includes multiple threads, and each thread is used to perform a data processing operation. During the data acquisition process, when the data source provides When a batch of data is obtained, the first thread pulls the batch of data from the data source, processes the data, and sends the processed data to the second thread. The second thread receives the data from the first thread. After the data is processed, the processed data is sent to a third thread, and so on. After the last thread receives the data and processes the data, the processed data is stored in the storage source. After that, the last thread will notify the data source that the batch of data it provided has been successfully stored in the library. After the data source is notified, it will provide the next batch of data, and the first thread will continue to pull the next batch from the data source. Data, and so on.

In the process of implementing this application, the inventors found that the related technology has at least the following problems:

At any time, the entire data collection system can only process a batch of data. Each thread must wait for other threads to process the batch of data before it can start receiving and processing the next batch of data. The efficiency of data collection is extremely low.

Summary of the Invention

The embodiments of the present application provide a data collection system, method, node device, and storage medium, which can solve the problem of extremely low data collection efficiency in related technologies. The technical solution is as follows:

In one aspect, a data acquisition system is provided, and the data acquisition system includes multiple data processing modules;

The first data processing module in the data acquisition system is used to instruct the data source to provide the next batch of data when any batch of data from the data source is obtained;

Any data processing module in the data acquisition system is configured to receive the next batch of data when performing a corresponding data processing operation on any batch of data that has been received;

The last data processing module in the data acquisition system is used to store the processed data in a first storage source.

In a possible design, any data processing module in the data acquisition system has a corresponding memory space, and the memory space is used to buffer the received data;

Any one of the data processing modules in the data acquisition system is further configured to cache the next batch of received data to the memory of any one of the data processing modules when performing a corresponding data processing operation on any batch of data that has been received In the space, when the processing of any one batch of data is completed, the next batch of data is read from the memory space.

In a possible design, the data acquisition system further includes at least one shared memory pool, and each shared memory pool is used to provide a memory space for a corresponding plurality of data processing modules.

In a possible design, any one of the data processing modules is further configured to apply for a memory space from a corresponding shared memory pool. When the memory space obtained by the application is used up, the memory space obtained by the application is released back to the server. Described in the shared memory pool.

In a possible design, the memory space of any data processing module in the data acquisition system includes on-chip memory;

Any one of the data processing modules is further configured to push the processed data to a heap memory of a next data processing module.

In a possible design, the memory space of any data processing module in the data acquisition system further includes off-heap memory;

The any data processing module is further configured to push the processed data to the in-heap memory of the next data processing module when the any data processing module and the next data processing module are located on the same node device; or,

The any data processing module is further configured to serialize the processed data and store the serialized data to the any data when the any processing module and the next data processing module are located on different node devices. The off-heap memory of the processing module pushes the serialized data from the off-heap memory to the on-heap memory of the next data processing module.

In a possible design, the data acquisition system further includes a second storage source, where the second storage source is used to store a snapshot unit of data currently processed by the data acquisition system, and the snapshot unit is used to indicate the current Data processing module for processing corresponding data;

Any one of the data processing modules is further configured to resume data processing on the data processed before the outage based on the snapshot unit in the second storage source after the any one of the data processing modules is down and restarted.

In this design, the snapshot mechanism is combined on the basis of a fully asynchronous data acquisition system, which can realize the function of resumed transmission at a breakpoint. When the data processing module is down and restarted, the data processing module can be based on the snapshot of the second storage source To restore the data processing of the data processed before the downtime, avoid the situation that the data must be restarted due to the downtime, and improve the robustness and reliability of the data collection system.

In a possible design, the first data processing module in the data acquisition system is further configured to insert a fence message between different data of the data source, and The data is used as a snapshot unit, and the snapshot unit is stored in the second storage source, where the fence message is used to indicate a starting point or an end point of the snapshot unit.

In a possible design, any data processing module in the data acquisition system is further configured to add the data to a snapshot unit to which the data belongs in the second storage source during the processing of any data. The ID of the processing module.

In a possible design, the first data processing module is further configured to reduce a speed of obtaining data from the data source when a node device where the first data processing module is located meets a preset condition.

In a possible design, any data processing module other than the first data processing module in the data acquisition system is further configured to indicate the first data when a corresponding node device meets a preset condition The processing module reduces the speed of obtaining data from the data source.

Based on the above design, the backpressure mechanism is combined on the basis of a fully asynchronous data acquisition system. When the node equipment of the data acquisition system meets the preset conditions, the first data processing module will reduce the speed of obtaining data from the data source. This can alleviate the lack of memory when facing traffic peaks, and improve the robustness and reliability of the data acquisition system.

In a possible design, the preset condition includes that a memory currently occupied by a shared memory pool of a node device reaches a first preset threshold, and the shared memory pool is used to provide multiple data processing modules of the node device. Memory space for buffering received data; and / or,

The preset condition includes that the total number of threads currently applying for memory of the data processing module of the node device reaches a second preset threshold.

In a possible design, any data processing module in the data acquisition system is further configured to perform a corresponding data processing operation on any batch of data based on at least one thread;

Any one of the data processing modules is further configured to increase the amount of concurrency when a traffic peak is detected. The amount of concurrency refers to the number of threads that process data in the data processing module.

In a possible design, the multiple data processing modules are respectively located in multiple node devices.

In another aspect, a data acquisition method is provided. The method is applied to a data acquisition system. The data acquisition system includes multiple data processing modules. The method includes:

When the first data processing module obtains any batch of data from the data source, the first data processing module instructs the data source to provide the next batch of data;

When any data processing module performs a corresponding data processing operation on any batch of data that has been received, the any data processing module receives the next batch of data;

The last data processing module stores the processed data in the first storage source.

When any data processing module performs a corresponding data processing operation on any batch of data that has been received, said any data processing module receiving the next batch of data includes:

When the any data processing module performs a corresponding data processing operation on any batch of data that has been received, the any data processing module buffers the next batch of received data to the memory of the any data processing module In space

When any one of the data processing modules finishes processing any of the batches of data, the any one of the data processing modules reads the next batch of data from the memory space.

In a possible design, when any data processing module performs a corresponding data processing operation on any batch of data that has been received, the any data processing module receives the next batch of data, including:

Any one of the data processing modules applies for a memory space from a corresponding shared memory pool;

When the memory space obtained by the application is used up, any one of the data processing modules releases the memory space obtained by the application back to the shared memory pool.

After the first data processing module instructs the data source to provide the next batch of data, the method further includes:

Any one of the data processing modules pushes the processed data to the on-chip memory of the next data processing module.

Pushing the processed data to the in-heap memory of the next data processing module by any one of the data processing modules includes:

When any one of the data processing modules and the next data processing module are located on the same node device, the any data processing module pushes the processed data to the on-chip memory of the next data processing module; or,

When any one of the processing modules and the next data processing module are located at different node devices, the any data processing module serializes the processed data and stores the serialized data to the data processing module of any one of the data processing modules. In the off-heap memory, the serialized data is pushed from the off-heap memory to the on-heap memory of a next data processing module.

The method further includes:

When any one of the data processing modules is down and restarted, the any one of the data processing modules resumes data processing on the data processed before the downtime based on the snapshot unit in the second storage source.

In a possible design, before any one of the data processing modules is based on the snapshot unit in the second storage source, and before restoring data processing to data processed before the outage, the method further includes:

A first data processing module inserts a fence message between different data of the data source;

The first data processing module uses all data between two adjacent fence messages as a snapshot unit;

The first data processing module stores the snapshot unit to the second storage source, wherein the fence message is used to indicate a start point or an end point of the snapshot unit.

In the process of processing any data by any data processing module, the any data processing module adds the identity of the data processing module to a snapshot unit to which the data belongs in the second storage source.

In a possible design, the method further includes:

When the node device where the first data processing module is located meets a preset condition, the first data processing module reduces the speed of obtaining data from the data source.

In a possible design, the method further includes:

When the node device corresponding to any data processing module other than the first data processing module meets a preset condition, any data processing module other than the first data processing module instructs the first data processing module Reduce the speed of obtaining data from the data source.

The method further includes:

In another aspect, a node device is provided. The node device includes a processor and a memory. The memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement the data acquisition method. Operation.

In another aspect, a computer-readable storage medium is provided. The storage medium stores at least one instruction, and the instruction is loaded and executed by the processor to implement the operations performed by the data collection method.

In another aspect, a computer program product containing instructions is provided, which when run on a node device, enables the node device to implement the event processing method described above.

In another aspect, a chip is provided. The chip includes a processor and / or program instructions. When the chip is running, the event processing method is implemented.

The beneficial effects brought by the technical solutions provided in the embodiments of the present application are:

The system, method, device and computer-readable storage medium provided in the embodiments of the present application design a fully asynchronous system architecture. The data acquisition system instructs the data source to provide the next batch of data through the first data processing module, and each data is processed. The module can receive and process data at the same time, achieving the effect of asynchronous processing of different data by different data processing modules, ensuring that the data acquisition system can process multiple batches of data at the same time, avoiding that the data processing module must wait for other data processing modules to perform data processing operations The situation in which data can be processed can improve the efficiency of data collection and save the time of data collection.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to explain the technical solutions in the embodiments of the present application more clearly, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are just some embodiments of the application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without paying creative labor.

FIG. 1 is a schematic structural diagram of a data acquisition system according to an embodiment of the present application; FIG.

2 is a schematic diagram of a distributed data acquisition system according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a data acquisition system according to an embodiment of the present application; FIG.

FIG. 4 is a schematic structural diagram of a data acquisition system provided by an embodiment of the present application; FIG.

FIG. 5 is a schematic structural diagram of a data acquisition system according to an embodiment of the present application; FIG.

FIG. 6 is a schematic structural diagram of a data acquisition system provided by an embodiment of the present application; FIG.

FIG. 7 is a schematic structural diagram of a data acquisition system according to an embodiment of the present application; FIG.

FIG. 8 is a schematic structural diagram of a data acquisition system according to an embodiment of the present application; FIG.

9 is a schematic diagram of inserting a fence message by a first data processing module according to an embodiment of the present application;

FIG. 10 is a flowchart of a data collection method according to an embodiment of the present application; FIG.

FIG. 11 is a schematic structural diagram of a node device according to an embodiment of the present application.

Detailed ways

In the following, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

Data acquisition (DAQ), also known as data acquisition, refers to the process of automatically collecting data from a data source and processing it and storing it in a database. With the continuous deepening of big data technology, the data sources are very rich and the data types are diverse. The amount of data that needs to be stored, analyzed, and mined by the data collection system is huge. At the same time, the performance requirements of the data collection system are also increasing, often requiring high data collection systems. Effectiveness and high reliability. Therefore, how to build a data acquisition system capable of efficiently collecting massive data has become a key issue of concern in the industry.

The embodiment of the present application sets up a new data acquisition system architecture. The data acquisition system operates asynchronously. The data acquisition system mainly has the following outstanding features: First, high efficiency: different data processing modules in the data acquisition system can be asynchronous. Process different batches of data, each data processing module can automatically process the next batch of data after processing the current data, without waiting for other data processing modules to complete processing. Second, distributed: The data acquisition system can support distributed acquisition methods, different data processing modules can be deployed in different node devices, and different node devices perform different data processing operations. Third: Scalable: the specific processing logic of the data processing module in the data acquisition system, the number of data processing modules, the node equipment where the data processing module is located, and the concurrent amount of the data processing module can be deployed according to actual business needs, flexibility high. Fourth, high reliability: The snapshot mechanism is designed in the data acquisition system to realize the function of continuous transmission of breakpoints. After any data processing module is down and restarted, it can automatically recover the data processed before the downtime, which is reliable and robust. Strong sex. Fifth: Robustness: A back pressure mechanism is designed in the data collection system. When the system faces traffic peaks, it can automatically reduce the speed of obtaining data from the data source, avoid overloading the node device, and avoid affecting the performance of the node device.

In the following, the architecture of the data collection system and the method for data collection based on such a system architecture are specifically explained.

FIG. 1 is a schematic structural diagram of a data acquisition system according to an embodiment of the present application. The data acquisition system includes multiple data processing modules. Multiple data processing operations can be allocated to different data processing modules according to actual business requirements. Each data processing module sequentially executes the corresponding data processing operation to complete the function of performing various data processing operations on the data.

System architecture of the data acquisition system: The data acquisition system can be regarded as an acquisition pipeline. Each data processing module in the data acquisition system is connected in cascade. The data output by the previous data processing module is used as the input data of the next data processing module. The first data processing module is connected to the data source, and the last data processing module is connected to the first storage source. Exemplarily, referring to FIG. 1, it is assumed that the data acquisition system includes M data processing modules (M is a positive integer greater than 1). For the first data processing module in the data acquisition system, the first data processing module and the The data source and the second data processing module are connected. For the i-th data processing module in the data acquisition system (i is a positive integer greater than 1 and less than M), the i-th data processing module is connected to the i-1th data The processing module and the (i + 1) th data processing module are connected. For the Mth data processing module (that is, the last data processing module) in the data acquisition system, the Mth data processing module and M-1 data processing The module and the first storage source are connected.

The data source refers to the source of the data. The data source is used to provide the raw data to be processed. The physical form of the data source can be determined according to the actual business scenario. For example, the data source can be a message queue and the data can be messages in the message queue. Another example is that the data source can be a socket (socket) port, the data can be socket data, and the data source can be a client, the data can be page data, interactive data, form data, session data, etc., and the data source can be For a camera, the data can be video captured by the camera, and if the data source can be a sensor, the data can be data collected by the sensor. The first storage source refers to a storage source for storing data that has undergone various data processing operations, and may also be referred to as low-level storage or storage. The first storage source is connected to the last data processing module, and can receive and store data processed by the last data processing module. The physical form of the first storage source may be determined according to an actual service scenario, and may be, for example, a hard disk, a database, or an FTP (File Transfer Protocol) server.

In this embodiment, the data processing module performs the data processing operation to actually perform the data processing operation for the node device where the data processing module is located. Any node device may include one or more data processing modules to execute the corresponding one or more types of data. Processing operation. The data processing module may be specifically implemented by software. For example, the data processing module may be a process, a thread, an object, a method, a function, a code block, a script file, or the like. Further, when the data processing module is a process, the data processing module may include multiple threads internally, and the multiple threads concurrently perform data processing operations.

Optionally, the data acquisition system may support a distributed acquisition and / or stand-alone acquisition architecture, which specifically includes the following two designs:

Design one (distributed acquisition): multiple data processing modules of the data acquisition system are located in multiple node devices, then, referring to Figure 2, the data acquisition system can be regarded as a collection of multiple node devices, each node device includes one Or multiple data processing modules, used to perform the data processing operations corresponding to the one or more data processing modules, different node devices can be connected through the network, different node devices can be distributed in different locations, can have different functions, and can also They have different physical forms, which are not limited in the embodiments of the present application. Further, the data collection system can adopt a cluster design, that is, the same data processing module can also be deployed in multiple node devices, and multiple node devices perform the same data processing operation.

Among them, which node device the data processing module in the data acquisition system is deployed in, and the number of data processing modules deployed by each node device can be determined according to actual business requirements. Exemplarily, referring to FIG. 3, it is assumed that the data acquisition system includes four data processing modules, and the four data processing modules can be deployed in three node devices in the cloud, and each node device can deploy one or more data processing modules. .

In the architecture of distributed acquisition, for the process of transmitting data by adjacent data processing modules, for any two adjacent data processing modules in the data acquisition system, if these two adjacent data processing modules are located at two nodes, respectively Device, two node devices can establish a network connection in advance, and these two data processing modules can perform data transmission through the network connection. Optionally, any data processing module may store the network address of the node device corresponding to the next data processing module, so as to send data to the next node device based on the network address. In addition, if the two adjacent data processing modules are located at the same node device, data transmission can be performed using the internal communication method of the device.

Optionally, in order to control each node device in the process of distributed acquisition, the data acquisition system may further include a resource manager, the resource manager may be regarded as a control system that controls the data acquisition system, and the resource manager may be deployed on the data In a node device other than the acquisition system, or deployed in a node device in the data acquisition system, developers can configure the process description information of the data acquisition system in the resource manager. The process description information includes various data processing modules. The connection relationship and the identification of each data processing module, the resource manager can send the configured process description information to each node device, each node device can receive the process description information, based on each data processing module in the machine To query the process description information, so as to determine the next data processing module of each data processing module in this machine, so that each data processing module of this machine sends data to the next data processing module in the process of collecting data. Among them, the resource manager can be Yarn, Mesos, and so on.

Design 2 (single-machine acquisition type): Referring to FIG. 4, multiple data processing modules of the data acquisition system are located in the same node device, and this one node device sequentially performs multiple data processing operations based on multiple data processing modules. Each data processing module may be a process, a thread, or a method in a node device.

In summary, the data acquisition system provided by the embodiments of the present application supports a distributed acquisition and / or stand-alone acquisition architecture, which can be deployed on one or more node devices, and the data processing module can be arbitrarily expanded according to actual business needs. High flexibility and scalability.

Functions of the data acquisition system: The data acquisition system is based on the data of the data source. Multiple data processing modules sequentially perform a variety of data processing operations, and the last data processing module stores the processed data in the first storage source. . Among them, each data processing module is used to perform one type of data processing operation, and different data processing operations can be performed on data through different data processing modules.

Exemplarily, the various data processing operations performed by the data acquisition system may include operations such as cleaning, transforming, filtering, deforming, statistics, and detecting the data. Combined with the actual business scenario, taking the number of reposts of the latest Weibo from a blogger on a website as an example, the data source can be the website, and the first data processing module can pull the Weibo page from the website and convert the Weibo The page is converted into structured data and sent to the second data processing module. The second data processing module filters all the data of the latest Weibo from the structured data and sends it to the third data processing module. The third data The processing module extracts the number of forwards from all the data of the latest Weibo and stores the number of forwards to the first storage source.

For the data processing process in the data acquisition system, the first data processing module will obtain the data from the data source, process the data from the data source, and send the processed data to the second data processing module and the second data processing After receiving the data processed by the first data processing module, the module processes the data, sends the processed data to the third data processing module, and so on, and the last data processing module receives the penultimate data After the data processed by the processing module is processed, the data is processed, and the processed data is stored in the first storage source. So far, the function of storing data after various data processing operations in the first storage source is implemented.

In this embodiment, the data collection system implements a fully asynchronous data collection process through the following two points, thereby greatly improving the efficiency of data collection:

First, when the first data processing module in the data acquisition system is used to obtain any batch of data from the data source, it instructs the data source to provide the next batch of data.

In related technology, the processing logic of the data collection system for processing data is: after any batch of data is successfully stored in the warehouse, the next batch of data is processed. Based on this processing logic, the last thread is responsible for feedback to the data source in the data acquisition system. After the data source provides a certain batch of data to the first thread, this batch of data must be processed by each thread in turn until the last one. When the thread finishes processing the data and stores the processed data to the storage source, the last thread will notify the data source to provide the next batch of data. During the period when the data is obtained from the first thread and the last thread provides feedback, the data source No next batch of data will be provided. Even if any thread is currently idle, it will not be able to obtain data to be processed, and will only be able to wait for other threads to finish processing data, which results in inefficiency and waste of processing resources.

In the data acquisition system provided by the embodiment of the present application, the processing logic of the data acquisition system and the subject of feedback to the data source are improved. The processing logic is: any batch of data has entered the data acquisition system, that is, the data source is notified to provide The next batch of data, without having to wait for this batch of data to be processed in turn within the data acquisition system. Based on this processing logic, the first data processing module of the data acquisition system is responsible for feedback to the data source. Then, when the data source provides a certain batch of data to the first data processing module, the first data processing module obtains this batch of data After that, the data source will be instructed to provide the next batch of data, and the data source can continue to provide the next batch of data to the first data processing module, so that the data collection system can process multiple batches of data asynchronously, avoiding the data processing module waiting for Data processing module.

Among them, with regard to the manner in which the first data processing module instructs the data source to provide the next batch of data, the first data processing module may send a confirmation message to the data source. The confirmation message is used to instruct the data source to provide the next batch of data. When the data source After receiving the confirmation message, the next batch of data is provided. The format of the confirmation message can be determined according to the communication protocol between the data source and the first data processing module. In addition, the amount of data of any batch of data provided by the data source can be determined according to the actual business scenario. No restrictions.

Exemplarily, the specific process of obtaining data of the data source by the first data processing module may include the following two designs:

Design one (active pull): The first data processing module can fetch data from the data source to obtain data from the data source. Specifically, the first data processing module may store the identity of the data source in advance, and access the data source based on the identity of the data source, thereby pulling data from the data source. The identifier of the data source is used to uniquely determine the corresponding data source, and may be a network address, a name, an index number, etc. of the data source.

Taking the data source as the message queue as an example, the first data processing module can store the name and network address of the message queue, and pull data from the message queue based on the name and network address of the message queue. The message queue refers to a virtual container that stores messages during the transmission of the messages. The message queues can be Kafka, ActiveMQ, RabbitMQ, ZeroMQ, MetaMQ, and so on.

Design 2 (passive reception): The data source can actively send data to the first data processing module, and the first data processing module can receive data from the data source, thereby obtaining data from the data source.

Taking the data source as the socket port as an example, the data source can call the socket port to send socket data to the first data processing module, and the first data processing module can listen to the socket port to receive socket data.

Secondly, any data processing module in the data collection system is further configured to receive the next batch of data when performing a corresponding data processing operation on any batch of data that has been received.

In the related technology, each thread in the data acquisition system uses a synchronous processing method for data processing: any thread first receives a batch of data and then processes the received data. During this period, the thread cannot receive continued data. You must wait for this. After the batch data is processed and sent to the next thread, the thread can receive the next batch of data for processing, that is, the thread will alternately perform the operation of receiving data and the operation of processing data, which can only be executed at any time. An operation.

In the data acquisition system provided by the embodiments of the present application, any data processing module uses asynchronous processing for data processing: the data processing module can perform operations of receiving data and processing data at the same time, and the data processing module executes data on any batch of data During the processing operation, the next batch of data can be received without waiting for this batch of data to be processed. Through asynchronous processing, when a batch of data is processed, the data processing module can automatically process the next batch of data that has been received, ensuring that the currently processed data does not block the inflow of the next batch of data, and thus ensuring that the data processing module Possibly always working. At the same time, different data processing modules in the data acquisition system can process different batches of data. For example, assuming that the third data processing module is still processing the first batch of data, and the first data processing module may have started processing the second batch of data. Data, in this way, can improve the overall collection efficiency of the data collection system.

Optionally, in order to be able to buffer the received data, any data processing module in the data acquisition system may have a corresponding memory space, where the memory space is used to buffer the received data, and any data processing module in the data acquisition system also has When performing a corresponding data processing operation on any batch of data that has been received, the next batch of data received is buffered into the memory space of any one of the data processing modules. Then, after processing any batch of data, the data processing module can directly read the next batch of data from the memory space, which greatly improves the efficiency of processing data.

Exemplarily, referring to FIG. 5, in the architecture of the data acquisition system, a memory space may be placed before any data processing module, and any batch of data may first enter the memory space of the data processing module and be cached in the memory space. Then enter the data processing module from the memory space. Among them, when the data enters the memory space of any data processing module, if the data processing module is currently idle, the data processing module can directly read the data from the memory space and start processing. If the data processing module is processing the previous data, you can After processing, read the data from the memory space and start processing.

It should be noted that the above description is only based on the operation of any batch of data and the next batch of data as an example. In implementation, any data processing module performs correspondence on any batch of data that has been received. During the data processing operation, multiple subsequent batches of data can be received, and after any batch of data is processed, data processing operations can be performed on the received multiple batches of data in sequence. For example, if the data processing module processes the first batch of data, if it receives the second batch of data, it caches the second batch of data into the memory space, and then receives the third batch of data, and then caches the third batch of data. In the memory space, when the first batch of data is processed, the second batch of data and the third batch of data are processed in turn, and so on.

Combined with the design of the memory space, after any data processing module in the data acquisition system performs a data processing operation, the processed data can be pushed to the memory space of the next data processing module. Exemplarily, any data processing module can store the port number of the communication port of the next data processing module, and send the processed data to the communication port of the next data processing module based on the port number. The communication port is bound to the corresponding communication port, and the data processed by any processing module is automatically buffered into the memory space corresponding to the communication port. It should be noted that the memory space described in this paragraph can refer to on-heap memory (on-heap memory). In the JVM (Java Virtual Machine, Java Virtual Machine) environment, the on-heap memory refers to the virtual machine managed by the node device. Memory has the advantages of easy implementation and so on. Any data processing module can be used to push the processed data to the on-chip memory of the next data processing module.

Optionally, in combination with the above-mentioned distributed acquisition method, the memory space of the data processing module may further include off-heap memory (off-heap memory). Off-heap memory refers to the memory managed by the operating system of the node device. Off-heap memory is usually It is used to store the data that the machine will send to the far end.

Based on the principles of on-heap memory and off-heap memory, the data processing module can cache the received data in on-heap memory and cache the data on the off-heap memory to be sent to data processing modules of other node devices.

Specifically, after any processing module processes data, when any data processing module and the next data processing module are located on the same node device, any data processing module can directly push the processed data to the next data processing module. In-memory. When any processing module and the next data processing module are located on different node devices, any data processing module can serialize the processed data and store the serialized data in the off-heap memory of any data processing module. , Push the serialized data from the off-heap memory to the on-heap memory of the next data processing module. Among them, serialization (Serialization) refers to the process of converting the format of the data so that the format of the converted data can be transmitted over the network. Optionally, any data processing module may use a zero-copy (zero copy) technology to execute the step of pushing the serialized data from the off-heap memory to the on-heap memory of the next data processing module. In addition, when the data processing module directly uses the memory space of the operating system to receive data, the memory space does not need to be distinguished into internal space and external space.

For example, referring to FIG. 6, it is assumed that the data processing module is represented by p. After the data processing module 1 (p1) finishes processing the data, it is assumed that p1 sends the processed data to p2 located on the same node device as p1. Then p1 can directly push the processed data to the on-chip memory of data processing module 2 (p2). Suppose p1 wants to send the processed data to data processing module 3 (p3) located on another node device, then p1 serializes the processed data first, and then stores the serialized data into the off-heap memory of p1. Then, the serialized data is pushed from the off-heap memory of p1 to the on-heap memory of p3.

Optionally, in order to allocate memory space to each data processing module, as shown in FIG. 7, the data acquisition system may further include at least one shared memory pool, each shared memory pool includes a large amount of memory space, and each shared memory pool is used for Corresponding multiple data processing modules provide memory space. For example, any node device in a data acquisition system can deploy a shared memory pool that provides memory space for all data processing modules in the node device. Then assume that the data acquisition system is deployed on M node devices (M is A positive integer not less than 1), the data acquisition system includes M sets of data processing modules, and accordingly includes M shared memory pools. Among them, the memory space in the shared memory pool can include on-heap memory and off-heap memory, the shared memory pool can include multiple memory pages, the memory page includes multiple memory segments, and one or more memory segments in the memory page can be used as each The memory space allocated for the data processing module. In implementation, when the data acquisition system is started, the operating system of each node device can request a certain size of continuous memory space as a shared memory pool created.

For the specific process of a group of data processing modules sharing memory space, before data collection, any node device can take a certain amount of memory space from the shared memory pool in advance, and evenly allocate it to each data processing module. During the data collection process, In any one of the data processing modules, the received data can be stored in a pre-allocated memory space. Further, as data continuously flows into the memory space of the data processing module, if the memory space of the data processing module is insufficient, the data processing module may reapply for the memory space in the shared memory pool, so as to store the received data in the requested memory space. , Thereby expanding its own memory space.

Optionally, in combination with the shared memory pool, this embodiment can support the function of memory reuse: after any data processing module applies for memory space from the shared memory pool, it can cache the received data into the requested memory space, and then from Read data from the memory space to use the requested memory space. After the requested memory space is used up, any data processing module can release the re-applied memory space back to the shared memory pool, and then the shared memory pool can reallocate the memory space to other data processing modules, or The data processing module is re-allocated to the data processing module next time when the memory space is insufficient, thereby realizing the memory reuse, improving the utilization of the memory space, and saving the memory resources. Among them, when the memory space in the shared memory pool is in units of memory pages, each data processing module can use as many pages as possible before using the next page, so that the memory pages can be continuously released when the memory is released. To avoid memory fragmentation and waste of memory space.

In summary, the data acquisition system provided in the embodiments of the present application is designed with a fully asynchronous data acquisition system. The first data processing module of the data acquisition system instructs the data source to provide the next batch of data, and each data is processed. The module can receive and process data at the same time, to achieve the effect of asynchronous processing of different data by different data processing modules in the data collection system, to ensure that the data collection system can process multiple batches of data at the same time, to avoid the data processing module waiting for other data processing modules to execute data The situation in which data can only be processed after the processing operation is completed improves the efficiency of data collection and saves the time of data collection.

Optionally, referring to FIG. 8, the data acquisition system further includes a second storage source. The second storage source is connected to each data processing module in the data acquisition system. The physical form of the second storage source may be similar to the first storage source. The second storage source may be specifically a distributed storage source. It should be noted that the second storage source and the first storage source can be different storage sources, for example, the second storage source and the first storage source can be deployed on different devices, of course, the second storage source and the first storage source can also be For the same storage source, this embodiment of the present application does not limit this.

Function of the second storage source: The second storage source is used to store a snapshot unit of the data currently processed by the data acquisition system. The snapshot unit refers to an image of the currently processed data. The snapshot unit can be regarded as the data in the data acquisition system. In the backup and archive during the transfer process, the snapshot unit is used to indicate the data processing module that currently processes the corresponding data. For example, the snapshot unit can carry the ID of the data processing module that currently processes the corresponding data, and the corresponding data processing module is indicated by the ID of the data processing module. The identifier of the data processing module is used to uniquely determine the corresponding data processing module, and may be the name, number, address, etc. of the data processing module.

In this embodiment, the following (1) to (3) can be used to implement the function of the snapshot unit where the second storage source stores the currently processed data:

(1) The first data processing module is also used to insert fence messages between different data from the data source, use all the data between two adjacent fence messages as a snapshot unit, and store the snapshot unit to the second Storage source.

The fence message is used to indicate the starting point or end point of the snapshot unit. The first data processing module can divide the data of the data source into different snapshot units by inserting the fence message. The fence message will not affect the normal processing operation of the data by the data processing module. The data processing module can automatically skip when it encounters the fence message during the data processing process.

Regarding the specific process of inserting a fence message, the first data processing module can insert a fence message into the data of the data source at a preset number after receiving any batch of data from the data source. Or start timing every time a fence message is inserted, and insert a fence message again when the recorded duration exceeds a preset duration. The specific value of the preset number and / or the preset duration may be determined according to actual business requirements, and may be set in advance by a developer.

For the specific process of storing the snapshot unit in the second storage source, after the first data processing module inserts the fence message, all data between two adjacent fence messages can be used as a snapshot unit, that is, for two For adjacent fence messages, the previous fence message can be used as the starting point of the snapshot unit, the next fence message can be used as the end point of the snapshot unit, and the previous fence message and between the two fence messages can be used. All the data and the next fence message constitute a snapshot unit, so as to obtain a snapshot unit. After that, the first data processing module may add the identification of the first data processing module to the snapshot unit, and then store the added snapshot unit to the second storage source.

Exemplarily, referring to FIG. 9, it is assumed that the data provided by the data source is represented by mi and the fence message is represented by bi. If the data provided by the data source is m1, m2, m3 ... m100, the first data processing module is every 4 A piece of data is inserted into a fence message bi. After the fence message is inserted, the data of the data source becomes b1, m1, m2, m3, m4, b2, m5, m6, m7, m8, b3 ... m100, the first The snapshot units are b1, m1, m2, m3, m4, the second snapshot unit is b2, m5, m6, m7, m8, b3, and so on.

(2) Any one of the data processing modules in the data collection system is further configured to add an identification of the data processing module to a snapshot unit to which the data belongs in the second storage source during the processing of any data.

As the snapshot unit flows between different data processing modules of the data acquisition system, multiple data processing modules of the data acquisition system will sequentially process the snapshot unit. During the processing of any data by any data processing module, any one The data processing module may determine the snapshot unit to which the data belongs, and add the identifier of any one of the data processing modules to the snapshot unit in the second storage source, so as to mark the snapshot unit is being processed in the data processing module through the identification of the data processing module.

Specifically, any data processing module can dynamically modify the snapshot unit and add the identity of the data processing module: before any data processing module processes the data in the snapshot unit, the snapshot unit will carry the data of the previous data module. ID, when the data processing module processes the data in the snapshot unit, it will modify the ID of the previous data module in the snapshot unit to the ID of the data processing module. In addition, during the processing of data by the data processing module, if the data processing module has modified some of the data in any snapshot unit, the data in the corresponding snapshot unit in the second storage source can also be replaced with the modified data. Data to refresh the snapshot unit. If the data processing module does not modify the received data, there is no need to modify the snapshot unit.

(3) After the last data processing module in the data acquisition system stores any data to the first storage source, it deletes the snapshot unit to which the data belongs from the second storage source.

For any batch of data from the storage source, when the batch of data has been processed by multiple data processing modules and stored by the last data processing module to the first storage source, it can be considered that this batch of data has reached For processing and storage purposes, even if the data acquisition system is down at this time, the batch of processed data can be obtained from the first storage source at any time after restarting, so the second storage source does not need to continue to store the snapshot unit of this batch of data. . Therefore, after the last processing module stores any batch of processed data in the first storage source, it can determine the snapshot unit to which the processed data belongs in the second storage source, and delete the snapshot unit from the second storage source. To save the storage space of the second storage source.

In combination with the above (1) to (3), this embodiment implements a snapshot mechanism. By marking the data processing module where the data flowing in the data acquisition system is currently located, the function of resuming the breakpoint can be realized: while the data acquisition system is running Under the influence of factors such as network abnormality and equipment failure, any data processing module may go down. When any data processing module goes down and restarts, any data processing module can be based on the snapshot unit in the second storage source. , Resume data processing before the downtime processing of data, without having to start from the beginning, re-processing the first batch of data.

Regarding the process of restoring the data processing of the data processed before the outage, any data processing module can search in the second storage source based on the identity of the data processing module, and obtain the added data processing from the second storage source. The at least one snapshot unit identified by the module. The data in the at least one snapshot unit is the data processed by the data processing module before the system goes down. The data processing module can obtain the at least one snapshot unit from the second storage source, and Processing of the data of the at least one snapshot unit is continued.

Optionally, in implementation, in various scenarios such as peak business seasons, peak passenger flows, and promotional activities, the amount of data provided by the data source will increase rapidly, reaching a maximum point, that is, a peak in traffic. In order to cope with traffic peaks, this embodiment provides the following two designs, which can ensure that the data acquisition system maintains the stability of the system when facing a traffic peak.

Design one (back pressure mechanism). When the data acquisition system is facing a peak of traffic, if the memory space is insufficient, it will enter a back pressure state, that is, reduce the speed of obtaining data from the data source, so as to prevent the amount of data provided by the data source from exceeding The load capacity of the system.

Optionally, the back pressure mechanism may be specifically implemented by the following (1) and (2):

(1) When the node device where the first data processing module is located meets the preset conditions, the first data processing module reduces the speed of obtaining data from the data source.

When faced with a traffic peak, data from the data source will continuously flow into the memory space of the first data processing module, resulting in more and more memory space occupied by the first data processing module. The first data processing module can detect whether the node device that meets the preset conditions in real time during the data collection process. When the node device meets the preset conditions, it can be learned that the memory space of the node device is insufficient. The speed of obtaining data, so as to avoid excessive consumption of the memory space of the node device, thereby avoiding affecting the performance of the node device.

The preset condition may include any combination of the following condition 1 and condition 2:

Condition 1. The memory currently occupied by the shared memory pool of the node device reaches a first preset threshold.

The shared memory pool of a node device can be regarded as the total memory space prepared by the node device for all internal data acquisition modules. When the shared memory pool currently occupies too much memory, it indicates the amount of data currently cached by each data processing module in the node device. It is too large and exceeds the data processing capacity of each data processing module.

Therefore, the first data processing module can detect in real time whether the memory currently occupied by the corresponding shared memory pool reaches the first preset threshold. When the memory currently occupied by the shared memory pool reaches the first preset threshold, it is determined that the node device meets the preset Conditions to reduce the speed of obtaining data from the data source, then, as each data processing module finishes processing the data, the memory space is gradually released back to the shared memory pool, and the memory in the shared memory pool will gradually fill up and return to the current A state in which the occupied memory is less than a first preset threshold. Then, the shared memory pool can maintain sufficient memory to ensure the normal operation of each data processing module in the node device. The first preset threshold may be determined according to business requirements, and may be set by a developer.

Condition 2: The total number of threads currently applying for memory of the data processing module of the node device reaches a second preset threshold.

Multiple threads can be active in any data processing module. Each thread can request memory from the shared memory pool to store the received data. If there are too many threads requesting memory, it indicates that the data currently cached by each data processing module in the node device. The volume has been too large, exceeding the data processing capacity of each data processing module.

Therefore, the first data processing module can detect in real time whether the total number of threads currently applying for memory of the data processing module in the node device reaches the second preset threshold, and if the total number of threads currently applying for memory reaches the second preset threshold The threshold value determines that the node device meets a preset condition and reduces the speed of obtaining data from the data source. Exemplarily, assuming that the node device where the first data processing module p1 is located includes three data processing modules: p1, p2, and p3, p1 can detect whether the total number of threads that apply for memory corresponding to p1, p2, and p3 reaches the second Preset threshold. The second preset threshold may be determined according to business requirements, and may be set by a developer.

For the specific process of reducing the speed of obtaining data from the data source, when the first data processing module obtains data from the data source in an active pull manner, the first data processing module can divide the original pull speed by the pre- Set the multiple to get the reduced pull speed, pull the data according to the reduced pull speed, and then slow down the data pull speed. For example, the first data processing module originally pulls data every 1s. After reducing the pull speed, the first data processing module can pull data every 5s. When the first data processing module obtains data from the data source in a passive receiving manner, the first data processing module can notify the data source to reduce the speed of sending data. After the data source is notified, it can reduce the speed of sending data, so that the first A data processing module can reduce the speed of obtaining data. The first data processing module can send a notification message to the data source, and notify the data source to reduce the speed of sending data through the notification message.

(2) Any data processing module other than the first data processing module in the data acquisition system is also used to instruct the first data processing module to reduce the speed of obtaining data from the data source when the corresponding node device meets the preset conditions.

Combined with the data flow in the data acquisition system, the first data processing module can be thought of as a faucet that controls the data flow rate in the entire system. The first data processing module can be considered as the source of the data of each subsequent data processing module. The speed at which the first data processing module obtains data directly affects the speed at which each subsequent data processing module obtains data.

Therefore, any data processing module other than the first data processing module can detect whether the corresponding node device satisfies a preset condition, and when the corresponding node device meets the preset condition, instruct the first data processing module to reduce the amount of data from the data source. Speed of getting data. After the first data processing module is instructed, it will reduce the speed of obtaining data from the data source. Then, because the speed of obtaining data by the first data processing module is reduced, each subsequent data processing module obtains data due to a chain reaction. The speed will also decrease. Among them, for any manner in which the data processing module instructs the first data processing module to reduce the speed of obtaining data from the data source, any data processing module may send a notification message to the first data processing module, and the notification message is used to notify Reduce the speed of obtaining data from the data source. The first data processing module receives the notification message to determine that the speed of obtaining data from the data source should be reduced. In addition, a specific process in which any data processing module other than the first data processing module detects that the corresponding node device satisfies a preset condition is performed together with the above steps, and details are not described herein.

It should be noted that on the basis of the backpressure mechanism, this embodiment also provides a mechanism for the data collection system to recover from the backpressure state to the original speed of data collection. In combination with (1) and (2) of the above backpressure mechanism, The mechanism for recovering from the back pressure state may specifically include the following (a) and (b).

(a) After the first data processing module reduces the speed of obtaining data from the data source, when the number of times that the node device where the first data processing module is located does not meet the preset conditions reaches the preset number of times, the first data processing module The speed at which the module recovers data from the data source.

This design corresponds to (1) of the back pressure mechanism. When the first data processing module reduces the speed of obtaining data from the data source, the data source data flow into the first data processing module will slow down, and the memory space of the node device of the first data processing module will gradually recover. sufficient. In this process, the first data processing module can detect whether the node device meets the preset condition again, and when the node device meets the preset condition, accumulate the number of times the node device meets the preset condition. When the number of times the node device meets the preset conditions has not reached the preset number of times, the memory space of the node device is considered insufficient. At this time, if the speed of obtaining data from the data source before recovery may cause the node device to enter the back pressure state soon. , Causing the state of the node device to fluctuate continuously, so the original speed of obtaining data from the data source is not restored for the time being, and when the node device meets the preset conditions the number of times reaches the preset number of times, it is considered that the memory space of the node device is sufficient, and then resume The speed at which data is obtained from the data source before entering the backpressure state. The preset number of times may be determined according to actual service requirements, and may be two times.

(b) After any data processing module other than the first data processing module in the data acquisition system instructs the first data processing module to reduce the speed of obtaining data from the data source, the node device where the data processing module is located is not satisfied When the number of preset conditions reaches the preset number of times, any one of the data processing modules indicates to the first data processing module the speed at which the first data processing module resumes acquiring data from the data source. Any one of the data processing modules may send a recovery message to the first data processing module, and the recovery message is used to instruct the first data processing module to recover the speed of obtaining data from the data source.

This design corresponds to (2) of the back pressure mechanism. When the first data processing module reduces the speed of obtaining data from the data source, the speed at which data from the data source flows into the first data processing module will slow down. The processing module is affected by the first data processing module. The speed at which data flows into any one of the data processing modules will slow down, and the memory space of the node device where the data processing module is located will gradually recover to sufficient. In this process, any data processing module can similarly detect whether the number of times that the node device where any data processing module does not meet the preset conditions reaches the preset number of times. When the node device where any data processing module is located does not meet When the number of preset conditions reaches the preset number of times, any one of the data processing modules will send a recovery message to the first data processing module. After receiving the recovery message, the first data processing module will recover data before entering the back pressure state. Speed, then as the speed of obtaining data by the first data processing module is accelerated, any one of the data processing modules is affected by the chain reaction, and the speed of obtaining data will also be accelerated.

Design 2 (increasing the amount of concurrency). In the implementation, any data processing module in the data acquisition system can generate at least one thread, and the processing logic of the data processing operation can be written into each thread. When the data processing module needs to process any In a batch of data, different data in the batch of data can be allocated to different threads according to needs, and at least one thread can be controlled to process data concurrently, thereby improving the efficiency of processing data. That is, any data processing module may perform a corresponding data processing operation on any batch of data based on at least one thread. Obviously, the greater the number of threads in the data processing module, the faster the data can be processed.

Therefore, when any data processing module in the data acquisition system detects a peak traffic, it can increase the amount of concurrency. The amount of concurrency refers to the number of threads that process data in the data processing module. The larger the amount of concurrency, the data processing module processes The faster the data speed, the more concurrent the data processing will be, reducing the pressure on processing large amounts of data.

For the specific process of detecting traffic peaks, any data processing module can detect the current speed of receiving data and determine whether the speed of receiving data reaches a preset speed threshold. When the speed of receiving data has reached a preset speed threshold, it indicates that the data is fast. Flowing into the data processing module, it is determined that a traffic peak has occurred. Alternatively, any data processing module can detect the current speed of receiving data, calculate the difference between the speed of receiving data and the speed of historical receiving data. When the difference has reached a preset difference threshold, it indicates that the amount of data is compared The rapid growth in the history collection process determines that a traffic peak has occurred. Of course, the data processing module can also detect traffic peaks in other ways, which is not limited.

In addition, after any one of the data processing modules increases the amount of concurrency, if the traffic peak is no longer detected, the amount of concurrency can be reduced, thereby restoring the amount of concurrency to the previous amount of concurrency.

It should be noted that, during implementation, the data processing module may choose to execute either of the first design and the second design. For example, after detecting the peak value of the traffic, it is possible to determine whether the node device meets the preset conditions. Design 2 is performed when the conditions are preset, and Design 1 is performed when the node device meets the preset conditions. Of course, the data processing module can also execute Design 1 and Design 2 at the same time.

Based on the system architecture of the data acquisition system and the functions of the data processing module inside the data acquisition system described in the embodiment of FIG. 1 described above, the data acquisition method provided by the embodiment of the present application is described below.

FIG. 10 is a flowchart of a data collection method according to an embodiment of the present application. The method is applied to a data collection system of a data collection system, and may be implemented by each data processing module in the data collection system interactively. Referring to FIG. 10, the method includes the following steps:

1001. When the first data processing module in the data acquisition system obtains any batch of data from the data source, the first data processing module instructs the data source to provide the next batch of data.

1002. When any one of the data processing modules in the data acquisition system performs a corresponding data processing operation on any batch of data that has been received, the any data processing module receives the next batch of data.

1003. The last data processing module in the data acquisition system stores the processed data in the first storage source.

The method provided in this embodiment provides a fully asynchronous data collection method. The first data processing module of the data collection system instructs the data source to provide the next batch of data, and each data processing module can receive and process data simultaneously. To ensure that different data processing modules in the data acquisition system can process different data asynchronously, to avoid the situation where the data processing module waits for other data processing modules to process the data before it can start processing the data, which improves the efficiency of data collection and saves data collection. time.

When any data processing module performs a corresponding data processing operation on any batch of data that has been received, the any data processing module receives the next batch of data, including:

When any one data processing module performs a corresponding data processing operation on any batch of data that has been received, the any data processing module buffers the next batch of received data into the memory space of any one data processing module;

After any data processing module finishes processing any batch of data, the any data processing module reads the next batch of data from the memory space.

Any one of the data processing modules applies for memory space from the corresponding shared memory pool;

When the memory space obtained by the application is used up, any data processing module releases the memory space obtained by the application back to the shared memory pool.

Any one of the data processing modules pushes the processed data to the on-chip memory of the next data processing module, including:

When the data processing module and the next data processing module are located on the same node device, the data processing module pushes the processed data to the on-chip memory of the next data processing module; or,

When any one of the processing modules and the next data processing module are located at different node devices, the any data processing module serializes the processed data and stores the serialized data to the off-heap memory of the any data processing module In the process, the serialized data is pushed from the off-heap memory to the on-heap memory of the next data processing module.

In a possible design, the data acquisition system further includes a second storage source, where the second storage source is used to store a snapshot unit of data currently processed by the data acquisition system, and the snapshot unit is used to indicate the current processing of the corresponding data. Data processing module

The method also includes:

When any data processing module is down and restarted, the any data processing module resumes data processing on the data processed before the downtime based on the snapshot unit in the second storage source.

In a possible design, before any one of the data processing modules is based on the snapshot unit in the second storage source, and before the data processing of the data processed before the downtime is resumed, the method further includes:

The first data processing module inserts a fence message between different data of the data source;

The first data processing module stores the snapshot unit in the second storage source, wherein the fence message is used to indicate a starting point or an end point of the snapshot unit.

During the processing of any data by any data processing module, the any data processing module adds the identity of the data processing module to the snapshot unit to which the data belongs in the second storage source.

In a possible design, the method further includes:

When the node device corresponding to any data processing module other than the first data processing module meets a preset condition, any data processing module other than the first data processing module instructs the first data processing module to reduce The speed at which data is retrieved from the data source.

In a possible design, the preset condition includes that a memory currently occupied by a node device's shared memory pool reaches a first preset threshold, and the shared memory pool is used to provide memory space for multiple data processing modules of the node device, This memory space is used to buffer the data that has been received; and / or,

The method also includes:

Any one of the data processing modules is also used to increase the amount of concurrency when a traffic peak is detected, and the amount of concurrency refers to the number of threads that process data in the data processing module.

It should be noted that the data collection method provided by this embodiment belongs to the same concept as the embodiment of the data collection system provided by the embodiment shown in FIG. 1 above, and the specific process thereof is described in detail in the embodiment shown in FIG.

FIG. 11 is a schematic structural diagram of a node device according to an embodiment of the present application. The node device 1100 may have a large difference due to different configurations or performance, and may include one or more processors (central processing units) (CPU) 1101. And one or more memories 1102, where at least one instruction is stored in the memory 1102, and the at least one instruction is loaded and executed by the processor 1101 to implement the methods provided by the foregoing method embodiments. Of course, the node device may also have components such as a wired or wireless network interface and an input-output interface for input and output, and the node device may further include other components for implementing device functions, and details are not described herein.

In an exemplary embodiment, a computer-readable storage medium is also provided, such as a memory including instructions, and the foregoing instructions may be executed by a processor in a node device to complete the data collection method in the foregoing embodiment. For example, the computer-readable storage medium may be Read-Only Memory (ROM), Random Access Memory (RAM), Compact Disc (Read-Only Memory, CD-ROM), Magnetic tapes, floppy disks, and optical data storage devices.

A person of ordinary skill in the art may understand that all or part of the steps for implementing the foregoing embodiments may be implemented by hardware, or may be instructed by a program to complete related hardware. The program may be stored in a computer-readable storage medium. The storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk.

The above is only a preferred embodiment of the present application and is not intended to limit the present application. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present application shall be included in the protection of the present application. Within range.

In an exemplary embodiment, the present application further provides a computer program product containing instructions, which when executed on a node device, enables the node device to implement the data collection method in the foregoing embodiment.

In an exemplary embodiment, the present application further provides a chip that includes a processor and / or program instructions. When the chip runs, the data acquisition method in the foregoing embodiment is implemented.

Those of ordinary skill in the art may understand that all or part of the steps for implementing the foregoing embodiments may be completed by hardware, or may be instructed by a program to complete related hardware. The program may be stored in a computer-readable storage medium. The storage medium may be a read-only memory, a magnetic disk or an optical disk.

Claims

A data acquisition system, characterized in that the data acquisition system includes a plurality of data processing modules;

The first data processing module in the data acquisition system is used to instruct the data source to provide the next batch of data when any batch of data from the data source is obtained;

Any data processing module in the data acquisition system is configured to receive the next batch of data when performing a corresponding data processing operation on any batch of data that has been received;

The last data processing module in the data acquisition system is configured to store the processed data in a first storage source.
The data acquisition system according to claim 1, wherein any one of the data processing modules in the data acquisition system has a corresponding memory space, and the memory space is used to buffer received data;

Any one of the data processing modules in the data acquisition system is further configured to cache the next batch of received data to the memory of any one of the data processing modules when performing a corresponding data processing operation on any batch of data that has been received In the space, when the processing of any one batch of data is completed, the next batch of data is read from the memory space.
The data acquisition system according to claim 2, wherein the data acquisition system further comprises at least one shared memory pool, and each shared memory pool is used to provide a memory space for a corresponding plurality of data processing modules.
The data acquisition system according to claim 3, wherein any one of the data processing modules is further configured to apply for a memory space from a corresponding shared memory pool, and when the memory space obtained by the application is used up, the application is applied. The obtained memory space is released back into the shared memory pool.
The data acquisition system according to any one of claims 2 to 4, wherein a memory space of any data processing module in the data acquisition system includes on-chip memory;

Any one of the data processing modules is further configured to push the processed data to a heap memory of a next data processing module.
The data acquisition system according to claim 5, wherein a memory space of any one of the data processing modules in the data acquisition system further comprises off-heap memory;

The any data processing module is further configured to push the processed data to the in-heap memory of the next data processing module when the any data processing module and the next data processing module are located on the same node device; or,

The any data processing module is further configured to serialize the processed data and store the serialized data to the any data when the any processing module and the next data processing module are located on different node devices. The off-heap memory of the processing module pushes the serialized data from the off-heap memory to the on-heap memory of the next data processing module.
The data acquisition system according to claim 1, wherein the data acquisition system further comprises a second storage source, the second storage source is used to store a snapshot unit of data currently processed by the data acquisition system, and The snapshot unit is used to indicate a data processing module that currently processes corresponding data;

Any one of the data processing modules is further configured to resume data processing on the data processed before the outage based on the snapshot unit in the second storage source after the any one of the data processing modules is down and restarted.
The data collection system according to claim 7, wherein the first data processing module in the data collection system is further configured to insert a fence message between different data of the data source, and All data between the fence messages is used as a snapshot unit, and the snapshot unit is stored in the second storage source, where the fence message is used to indicate a start point or an end point of the snapshot unit.
The data collection system according to claim 7 or 8, wherein any one of the data processing modules in the data collection system is further configured to, in the process of processing any data, send the data to the second storage source. The snapshot unit to which the data belongs adds the identification of the data processing module.
The data acquisition system according to claim 1 or 2, wherein the first data processing module is further configured to reduce a slave device when the node device where the first data processing module is located meets a preset condition. Describes the speed of obtaining data from the data source.
The data acquisition system according to claim 10, wherein any data processing module other than the first data processing module in the data acquisition system is further configured to, when a corresponding node device meets a preset condition, Instruct the first data processing module to reduce the speed of obtaining data from the data source.
The data acquisition system according to claim 10 or 11, wherein the preset condition includes that a memory currently occupied by the shared memory pool of the node device reaches a first preset threshold, and the shared memory pool is configured for the A plurality of data processing modules of the node device provide a memory space, which is used to buffer the received data; and / or,

The preset condition includes that the total number of threads currently applying for memory of the data processing module of the node device reaches a second preset threshold.
The data acquisition system according to claim 1, wherein any one of the data processing modules in the data acquisition system is further configured to perform a corresponding data processing operation on any batch of data based on at least one thread;

Any one of the data processing modules is further configured to increase the amount of concurrency when a traffic peak is detected. The amount of concurrency refers to the number of threads that process data in the data processing module.
The data acquisition system according to claim 1, wherein the plurality of data processing modules are respectively located in a plurality of node devices.
A data acquisition method, characterized in that the method is applied to a data acquisition system, the data acquisition system includes a plurality of data processing modules, and the method includes:

When the first data processing module obtains any batch of data from the data source, the first data processing module instructs the data source to provide the next batch of data;

When any data processing module performs a corresponding data processing operation on any batch of data that has been received, the any data processing module receives the next batch of data;

The last data processing module stores the processed data in the first storage source.
The method according to claim 15, wherein any one of the data processing modules in the data acquisition system has a corresponding memory space, and the memory space is used to buffer the received data;

When any data processing module performs a corresponding data processing operation on any batch of data that has been received, said any data processing module receiving the next batch of data includes:

When the any data processing module performs a corresponding data processing operation on any batch of data that has been received, the any data processing module buffers the next batch of received data to the memory of the any data processing module In space

When any one of the data processing modules finishes processing any of the batches of data, the any one of the data processing modules reads the next batch of data from the memory space.
The method according to claim 15, wherein the data acquisition system further comprises at least one shared memory pool, and each shared memory pool is used to provide a memory space for a corresponding plurality of data processing modules.
The method according to claim 15, characterized in that, when any one of the data processing modules performs a corresponding data processing operation on any batch of data that has been received, the any data processing module receives the next batch of data ,include:

Any one of the data processing modules applies for a memory space from a corresponding shared memory pool;

When the memory space obtained by the application is used up, any one of the data processing modules releases the memory space obtained by the application back to the shared memory pool.
The method according to any one of claims 16 to 18, wherein a memory space of any one of the data processing modules in the data acquisition system includes on-chip memory;

After the first data processing module instructs the data source to provide the next batch of data, the method further includes:

Any one of the data processing modules pushes the processed data to the on-chip memory of the next data processing module.
The method according to claim 19, wherein the memory space of any data processing module in the data acquisition system further comprises off-heap memory;

Pushing the processed data to the in-heap memory of the next data processing module by any one of the data processing modules includes:

When any one of the data processing module and the next data processing module are located on the same node device, the any data processing module pushes the processed data to the on-chip memory of the next data processing module; or,

When any one of the processing modules and the next data processing module are located at different node devices, the any data processing module serializes the processed data and stores the serialized data to the data processing module of any one of the data processing modules. In the off-heap memory, the serialized data is pushed from the off-heap memory to the on-heap memory of a next data processing module.
The method according to claim 15, wherein the data acquisition system further comprises a second storage source, the second storage source being used to store a snapshot unit of data currently processed by the data acquisition system, the snapshot A unit for instructing a data processing module that currently processes corresponding data;

The method further includes:

When any one of the data processing modules is down and restarted, the any one of the data processing modules resumes data processing on the data processed before the downtime based on the snapshot unit in the second storage source.
The method according to claim 21, wherein before any one of the data processing modules is based on a snapshot unit in the second storage source, and before the data processing is resumed on the data processed before the downtime, the method further comprises :

A first data processing module inserts a fence message between different data of the data source;

The first data processing module uses all data between two adjacent fence messages as a snapshot unit;

The first data processing module stores the snapshot unit to the second storage source, wherein the fence message is used to indicate a start point or an end point of the snapshot unit.
The method according to claim 21 or 22, wherein the method is based on a snapshot unit in the second storage source, and resumes the data processing before the data processing before the downtime. Also includes:

In the process of processing any data by any data processing module, the any data processing module adds the identity of the data processing module to a snapshot unit to which the data belongs in the second storage source.
The method according to claim 15 or 16, further comprising:

When the node device where the first data processing module is located meets a preset condition, the first data processing module reduces the speed of obtaining data from the data source.
The method according to claim 24, further comprising:

When the node device corresponding to any data processing module other than the first data processing module meets a preset condition, any data processing module other than the first data processing module instructs the first data processing module Reduce the speed of obtaining data from the data source.
The method according to claim 24 or 25, wherein the preset condition comprises that a memory currently occupied by a shared memory pool of a node device reaches a first preset threshold, and the shared memory pool is used for the node device A plurality of data processing modules provided with a memory space for buffering received data; and / or,

The preset condition includes that the total number of threads currently applying for memory of the data processing module of the node device reaches a second preset threshold.
The method according to claim 15, wherein any one of the data processing modules in the data acquisition system is further configured to perform a corresponding data processing operation on any batch of data based on at least one thread;

The method further includes:

Any one of the data processing modules is further configured to increase the amount of concurrency when a traffic peak is detected. The amount of concurrency refers to the number of threads that process data in the data processing module.
The method according to claim 15, wherein the plurality of data processing modules are respectively located in a plurality of node devices.
A node device, wherein the node device includes a processor and a memory, and the memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement claims 15 to 28 Operations performed by the data collection method of any one.
A computer-readable storage medium, characterized in that at least one instruction is stored in the storage medium, and the instruction is loaded and executed by the processor to implement the method according to any one of claims 15 to 28. The operations performed by the data collection method.