WO2019223599A1 - Data acquisition system and method, node device and storage medium - Google Patents

Data acquisition system and method, node device and storage medium Download PDF

Info

Publication number
WO2019223599A1
WO2019223599A1 PCT/CN2019/087226 CN2019087226W WO2019223599A1 WO 2019223599 A1 WO2019223599 A1 WO 2019223599A1 CN 2019087226 W CN2019087226 W CN 2019087226W WO 2019223599 A1 WO2019223599 A1 WO 2019223599A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
data processing
processing module
acquisition system
memory
Prior art date
Application number
PCT/CN2019/087226
Other languages
French (fr)
Chinese (zh)
Inventor
郭峰
Original Assignee
杭州海康威视数字技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杭州海康威视数字技术股份有限公司 filed Critical 杭州海康威视数字技术股份有限公司
Publication of WO2019223599A1 publication Critical patent/WO2019223599A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/40Data acquisition and logging

Definitions

  • the present application relates to the field of big data technology, and in particular, to a data collection system, method, node device, and storage medium.
  • Data collection refers to the process of processing data in a data source through a series of processing operations, and finally storing the processed data to a storage source. Data collection can help people manage, analyze, and mine data, and it has great economic and application value.
  • the data acquisition system includes multiple threads, and each thread is used to perform a data processing operation.
  • the first thread pulls the batch of data from the data source, processes the data, and sends the processed data to the second thread.
  • the second thread receives the data from the first thread.
  • the processed data is sent to a third thread, and so on.
  • the last thread receives the data and processes the data, the processed data is stored in the storage source. After that, the last thread will notify the data source that the batch of data it provided has been successfully stored in the library. After the data source is notified, it will provide the next batch of data, and the first thread will continue to pull the next batch from the data source. Data, and so on.
  • the entire data collection system can only process a batch of data. Each thread must wait for other threads to process the batch of data before it can start receiving and processing the next batch of data. The efficiency of data collection is extremely low.
  • the embodiments of the present application provide a data collection system, method, node device, and storage medium, which can solve the problem of extremely low data collection efficiency in related technologies.
  • the technical solution is as follows:
  • a data acquisition system includes multiple data processing modules;
  • the first data processing module in the data acquisition system is used to instruct the data source to provide the next batch of data when any batch of data from the data source is obtained;
  • Any data processing module in the data acquisition system is configured to receive the next batch of data when performing a corresponding data processing operation on any batch of data that has been received;
  • the last data processing module in the data acquisition system is used to store the processed data in a first storage source.
  • any data processing module in the data acquisition system has a corresponding memory space, and the memory space is used to buffer the received data;
  • Any one of the data processing modules in the data acquisition system is further configured to cache the next batch of received data to the memory of any one of the data processing modules when performing a corresponding data processing operation on any batch of data that has been received In the space, when the processing of any one batch of data is completed, the next batch of data is read from the memory space.
  • the data acquisition system further includes at least one shared memory pool, and each shared memory pool is used to provide a memory space for a corresponding plurality of data processing modules.
  • any one of the data processing modules is further configured to apply for a memory space from a corresponding shared memory pool.
  • the memory space obtained by the application is used up, the memory space obtained by the application is released back to the server. Described in the shared memory pool.
  • the memory space of any data processing module in the data acquisition system includes on-chip memory
  • Any one of the data processing modules is further configured to push the processed data to a heap memory of a next data processing module.
  • the memory space of any data processing module in the data acquisition system further includes off-heap memory
  • the any data processing module is further configured to push the processed data to the in-heap memory of the next data processing module when the any data processing module and the next data processing module are located on the same node device; or,
  • the any data processing module is further configured to serialize the processed data and store the serialized data to the any data when the any processing module and the next data processing module are located on different node devices.
  • the off-heap memory of the processing module pushes the serialized data from the off-heap memory to the on-heap memory of the next data processing module.
  • the data acquisition system further includes a second storage source, where the second storage source is used to store a snapshot unit of data currently processed by the data acquisition system, and the snapshot unit is used to indicate the current Data processing module for processing corresponding data;
  • Any one of the data processing modules is further configured to resume data processing on the data processed before the outage based on the snapshot unit in the second storage source after the any one of the data processing modules is down and restarted.
  • the snapshot mechanism is combined on the basis of a fully asynchronous data acquisition system, which can realize the function of resumed transmission at a breakpoint.
  • the data processing module can be based on the snapshot of the second storage source To restore the data processing of the data processed before the downtime, avoid the situation that the data must be restarted due to the downtime, and improve the robustness and reliability of the data collection system.
  • the first data processing module in the data acquisition system is further configured to insert a fence message between different data of the data source, and The data is used as a snapshot unit, and the snapshot unit is stored in the second storage source, where the fence message is used to indicate a starting point or an end point of the snapshot unit.
  • any data processing module in the data acquisition system is further configured to add the data to a snapshot unit to which the data belongs in the second storage source during the processing of any data.
  • the ID of the processing module is further configured to add the data to a snapshot unit to which the data belongs in the second storage source during the processing of any data.
  • the first data processing module is further configured to reduce a speed of obtaining data from the data source when a node device where the first data processing module is located meets a preset condition.
  • any data processing module other than the first data processing module in the data acquisition system is further configured to indicate the first data when a corresponding node device meets a preset condition
  • the processing module reduces the speed of obtaining data from the data source.
  • the backpressure mechanism is combined on the basis of a fully asynchronous data acquisition system.
  • the first data processing module will reduce the speed of obtaining data from the data source. This can alleviate the lack of memory when facing traffic peaks, and improve the robustness and reliability of the data acquisition system.
  • the preset condition includes that a memory currently occupied by a shared memory pool of a node device reaches a first preset threshold, and the shared memory pool is used to provide multiple data processing modules of the node device. Memory space for buffering received data; and / or,
  • the preset condition includes that the total number of threads currently applying for memory of the data processing module of the node device reaches a second preset threshold.
  • any data processing module in the data acquisition system is further configured to perform a corresponding data processing operation on any batch of data based on at least one thread;
  • Any one of the data processing modules is further configured to increase the amount of concurrency when a traffic peak is detected.
  • the amount of concurrency refers to the number of threads that process data in the data processing module.
  • the multiple data processing modules are respectively located in multiple node devices.
  • a data acquisition method is provided. The method is applied to a data acquisition system.
  • the data acquisition system includes multiple data processing modules.
  • the method includes:
  • the first data processing module When the first data processing module obtains any batch of data from the data source, the first data processing module instructs the data source to provide the next batch of data;
  • any data processing module When any data processing module performs a corresponding data processing operation on any batch of data that has been received, the any data processing module receives the next batch of data;
  • the last data processing module stores the processed data in the first storage source.
  • any data processing module in the data acquisition system has a corresponding memory space, and the memory space is used to buffer the received data;
  • any data processing module performs a corresponding data processing operation on any batch of data that has been received, said any data processing module receiving the next batch of data includes:
  • the any data processing module When the any data processing module performs a corresponding data processing operation on any batch of data that has been received, the any data processing module buffers the next batch of received data to the memory of the any data processing module In space
  • the any one of the data processing modules When any one of the data processing modules finishes processing any of the batches of data, the any one of the data processing modules reads the next batch of data from the memory space.
  • the data acquisition system further includes at least one shared memory pool, and each shared memory pool is used to provide a memory space for a corresponding plurality of data processing modules.
  • any data processing module when any data processing module performs a corresponding data processing operation on any batch of data that has been received, the any data processing module receives the next batch of data, including:
  • Any one of the data processing modules applies for a memory space from a corresponding shared memory pool
  • any one of the data processing modules releases the memory space obtained by the application back to the shared memory pool.
  • the memory space of any data processing module in the data acquisition system includes on-chip memory
  • the method further includes:
  • Any one of the data processing modules pushes the processed data to the on-chip memory of the next data processing module.
  • the memory space of any data processing module in the data acquisition system further includes off-heap memory
  • Pushing the processed data to the in-heap memory of the next data processing module by any one of the data processing modules includes:
  • the any data processing module pushes the processed data to the on-chip memory of the next data processing module;
  • the any data processing module serializes the processed data and stores the serialized data to the data processing module of any one of the data processing modules.
  • the serialized data is pushed from the off-heap memory to the on-heap memory of a next data processing module.
  • the data acquisition system further includes a second storage source, where the second storage source is used to store a snapshot unit of data currently processed by the data acquisition system, and the snapshot unit is used to indicate the current Data processing module for processing corresponding data;
  • the method further includes:
  • any one of the data processing modules When any one of the data processing modules is down and restarted, the any one of the data processing modules resumes data processing on the data processed before the downtime based on the snapshot unit in the second storage source.
  • the method before any one of the data processing modules is based on the snapshot unit in the second storage source, and before restoring data processing to data processed before the outage, the method further includes:
  • a first data processing module inserts a fence message between different data of the data source
  • the first data processing module uses all data between two adjacent fence messages as a snapshot unit
  • the first data processing module stores the snapshot unit to the second storage source, wherein the fence message is used to indicate a start point or an end point of the snapshot unit.
  • the method before any one of the data processing modules is based on the snapshot unit in the second storage source, and before restoring data processing to data processed before the outage, the method further includes:
  • the any data processing module adds the identity of the data processing module to a snapshot unit to which the data belongs in the second storage source.
  • the method further includes:
  • the first data processing module When the node device where the first data processing module is located meets a preset condition, the first data processing module reduces the speed of obtaining data from the data source.
  • the method further includes:
  • any data processing module other than the first data processing module instructs the first data processing module Reduce the speed of obtaining data from the data source.
  • the preset condition includes that a memory currently occupied by a shared memory pool of a node device reaches a first preset threshold, and the shared memory pool is used to provide multiple data processing modules of the node device. Memory space for buffering received data; and / or,
  • the preset condition includes that the total number of threads currently applying for memory of the data processing module of the node device reaches a second preset threshold.
  • any data processing module in the data acquisition system is further configured to perform a corresponding data processing operation on any batch of data based on at least one thread;
  • the method further includes:
  • Any one of the data processing modules is further configured to increase the amount of concurrency when a traffic peak is detected.
  • the amount of concurrency refers to the number of threads that process data in the data processing module.
  • the multiple data processing modules are respectively located in multiple node devices.
  • a node device in another aspect, includes a processor and a memory.
  • the memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement the data acquisition method. Operation.
  • a computer-readable storage medium stores at least one instruction, and the instruction is loaded and executed by the processor to implement the operations performed by the data collection method.
  • a computer program product containing instructions which when run on a node device, enables the node device to implement the event processing method described above.
  • a chip in another aspect, includes a processor and / or program instructions. When the chip is running, the event processing method is implemented.
  • the system, method, device and computer-readable storage medium provided in the embodiments of the present application design a fully asynchronous system architecture.
  • the data acquisition system instructs the data source to provide the next batch of data through the first data processing module, and each data is processed.
  • the module can receive and process data at the same time, achieving the effect of asynchronous processing of different data by different data processing modules, ensuring that the data acquisition system can process multiple batches of data at the same time, avoiding that the data processing module must wait for other data processing modules to perform data processing operations
  • the situation in which data can be processed can improve the efficiency of data collection and save the time of data collection.
  • FIG. 1 is a schematic structural diagram of a data acquisition system according to an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a distributed data acquisition system according to an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of a data acquisition system according to an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a data acquisition system provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a data acquisition system according to an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a data acquisition system provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a data acquisition system according to an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a data acquisition system according to an embodiment of the present application.
  • FIG. 9 is a schematic diagram of inserting a fence message by a first data processing module according to an embodiment of the present application.
  • FIG. 10 is a flowchart of a data collection method according to an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a node device according to an embodiment of the present application.
  • Data acquisition also known as data acquisition, refers to the process of automatically collecting data from a data source and processing it and storing it in a database.
  • DAQ Data acquisition
  • the data sources are very rich and the data types are diverse.
  • the amount of data that needs to be stored, analyzed, and mined by the data collection system is huge.
  • the performance requirements of the data collection system are also increasing, often requiring high data collection systems. Effectiveness and high reliability. Therefore, how to build a data acquisition system capable of efficiently collecting massive data has become a key issue of concern in the industry.
  • the embodiment of the present application sets up a new data acquisition system architecture.
  • the data acquisition system operates asynchronously.
  • the data acquisition system mainly has the following outstanding features: First, high efficiency: different data processing modules in the data acquisition system can be asynchronous. Process different batches of data, each data processing module can automatically process the next batch of data after processing the current data, without waiting for other data processing modules to complete processing. Second, distributed: The data acquisition system can support distributed acquisition methods, different data processing modules can be deployed in different node devices, and different node devices perform different data processing operations. Third: Scalable: the specific processing logic of the data processing module in the data acquisition system, the number of data processing modules, the node equipment where the data processing module is located, and the concurrent amount of the data processing module can be deployed according to actual business needs, flexibility high.
  • the snapshot mechanism is designed in the data acquisition system to realize the function of continuous transmission of breakpoints. After any data processing module is down and restarted, it can automatically recover the data processed before the downtime, which is reliable and robust. Strong sex.
  • FIG. 1 is a schematic structural diagram of a data acquisition system according to an embodiment of the present application.
  • the data acquisition system includes multiple data processing modules. Multiple data processing operations can be allocated to different data processing modules according to actual business requirements. Each data processing module sequentially executes the corresponding data processing operation to complete the function of performing various data processing operations on the data.
  • the data acquisition system can be regarded as an acquisition pipeline. Each data processing module in the data acquisition system is connected in cascade. The data output by the previous data processing module is used as the input data of the next data processing module.
  • the first data processing module is connected to the data source, and the last data processing module is connected to the first storage source. Exemplarily, referring to FIG. 1, it is assumed that the data acquisition system includes M data processing modules (M is a positive integer greater than 1). For the first data processing module in the data acquisition system, the first data processing module and the The data source and the second data processing module are connected.
  • the i-th data processing module in the data acquisition system (i is a positive integer greater than 1 and less than M)
  • the i-th data processing module is connected to the i-1th data
  • the processing module and the (i + 1) th data processing module are connected.
  • the Mth data processing module (that is, the last data processing module) in the data acquisition system
  • the Mth data processing module and M-1 data processing The module and the first storage source are connected.
  • the data source refers to the source of the data.
  • the data source is used to provide the raw data to be processed.
  • the physical form of the data source can be determined according to the actual business scenario.
  • the data source can be a message queue and the data can be messages in the message queue.
  • the data source can be a socket (socket) port
  • the data can be socket data
  • the data source can be a client
  • the data can be page data, interactive data, form data, session data, etc.
  • the data source can be
  • the data can be video captured by the camera
  • the data source can be a sensor
  • the data can be data collected by the sensor.
  • the first storage source refers to a storage source for storing data that has undergone various data processing operations, and may also be referred to as low-level storage or storage.
  • the first storage source is connected to the last data processing module, and can receive and store data processed by the last data processing module.
  • the physical form of the first storage source may be determined according to an actual service scenario, and may be, for example, a hard disk, a database, or an FTP (File Transfer Protocol) server.
  • the data processing module performs the data processing operation to actually perform the data processing operation for the node device where the data processing module is located.
  • Any node device may include one or more data processing modules to execute the corresponding one or more types of data. Processing operation.
  • the data processing module may be specifically implemented by software.
  • the data processing module may be a process, a thread, an object, a method, a function, a code block, a script file, or the like.
  • the data processing module may include multiple threads internally, and the multiple threads concurrently perform data processing operations.
  • the data acquisition system may support a distributed acquisition and / or stand-alone acquisition architecture, which specifically includes the following two designs:
  • each node device the data processing module in the data acquisition system is deployed in, and the number of data processing modules deployed by each node device can be determined according to actual business requirements.
  • the data acquisition system includes four data processing modules, and the four data processing modules can be deployed in three node devices in the cloud, and each node device can deploy one or more data processing modules. .
  • any two adjacent data processing modules in the data acquisition system for the process of transmitting data by adjacent data processing modules, for any two adjacent data processing modules in the data acquisition system, if these two adjacent data processing modules are located at two nodes, respectively Device, two node devices can establish a network connection in advance, and these two data processing modules can perform data transmission through the network connection.
  • any data processing module may store the network address of the node device corresponding to the next data processing module, so as to send data to the next node device based on the network address.
  • data transmission can be performed using the internal communication method of the device.
  • the data acquisition system may further include a resource manager
  • the resource manager may be regarded as a control system that controls the data acquisition system
  • the resource manager may be deployed on the data
  • developers can configure the process description information of the data acquisition system in the resource manager.
  • the process description information includes various data processing modules.
  • the resource manager can send the configured process description information to each node device, each node device can receive the process description information, based on each data processing module in the machine To query the process description information, so as to determine the next data processing module of each data processing module in this machine, so that each data processing module of this machine sends data to the next data processing module in the process of collecting data.
  • the resource manager can be Yarn, Mesos, and so on.
  • Design 2 (single-machine acquisition type): Referring to FIG. 4, multiple data processing modules of the data acquisition system are located in the same node device, and this one node device sequentially performs multiple data processing operations based on multiple data processing modules.
  • Each data processing module may be a process, a thread, or a method in a node device.
  • the data acquisition system provided by the embodiments of the present application supports a distributed acquisition and / or stand-alone acquisition architecture, which can be deployed on one or more node devices, and the data processing module can be arbitrarily expanded according to actual business needs. High flexibility and scalability.
  • the data acquisition system is based on the data of the data source. Multiple data processing modules sequentially perform a variety of data processing operations, and the last data processing module stores the processed data in the first storage source. . Among them, each data processing module is used to perform one type of data processing operation, and different data processing operations can be performed on data through different data processing modules.
  • the various data processing operations performed by the data acquisition system may include operations such as cleaning, transforming, filtering, deforming, statistics, and detecting the data.
  • the data source can be the website
  • the first data processing module can pull the Weibo page from the website and convert the Weibo
  • the page is converted into structured data and sent to the second data processing module.
  • the second data processing module filters all the data of the latest Weibo from the structured data and sends it to the third data processing module.
  • the third data extracts the number of forwards from all the data of the latest Weibo and stores the number of forwards to the first storage source.
  • the first data processing module will obtain the data from the data source, process the data from the data source, and send the processed data to the second data processing module and the second data processing
  • the module processes the data, sends the processed data to the third data processing module, and so on, and the last data processing module receives the penultimate data
  • the data is processed, and the processed data is stored in the first storage source. So far, the function of storing data after various data processing operations in the first storage source is implemented.
  • the data collection system implements a fully asynchronous data collection process through the following two points, thereby greatly improving the efficiency of data collection:
  • the first data processing module in the data acquisition system when used to obtain any batch of data from the data source, it instructs the data source to provide the next batch of data.
  • the processing logic of the data collection system for processing data is: after any batch of data is successfully stored in the warehouse, the next batch of data is processed. Based on this processing logic, the last thread is responsible for feedback to the data source in the data acquisition system. After the data source provides a certain batch of data to the first thread, this batch of data must be processed by each thread in turn until the last one. When the thread finishes processing the data and stores the processed data to the storage source, the last thread will notify the data source to provide the next batch of data. During the period when the data is obtained from the first thread and the last thread provides feedback, the data source No next batch of data will be provided. Even if any thread is currently idle, it will not be able to obtain data to be processed, and will only be able to wait for other threads to finish processing data, which results in inefficiency and waste of processing resources.
  • the processing logic of the data acquisition system and the subject of feedback to the data source are improved.
  • the processing logic is: any batch of data has entered the data acquisition system, that is, the data source is notified to provide The next batch of data, without having to wait for this batch of data to be processed in turn within the data acquisition system.
  • the first data processing module of the data acquisition system is responsible for feedback to the data source.
  • the first data processing module obtains this batch of data After that, the data source will be instructed to provide the next batch of data, and the data source can continue to provide the next batch of data to the first data processing module, so that the data collection system can process multiple batches of data asynchronously, avoiding the data processing module waiting for Data processing module.
  • the first data processing module may send a confirmation message to the data source.
  • the confirmation message is used to instruct the data source to provide the next batch of data.
  • the data source After receiving the confirmation message, the next batch of data is provided.
  • the format of the confirmation message can be determined according to the communication protocol between the data source and the first data processing module.
  • the amount of data of any batch of data provided by the data source can be determined according to the actual business scenario. No restrictions.
  • the specific process of obtaining data of the data source by the first data processing module may include the following two designs:
  • the first data processing module can fetch data from the data source to obtain data from the data source. Specifically, the first data processing module may store the identity of the data source in advance, and access the data source based on the identity of the data source, thereby pulling data from the data source.
  • the identifier of the data source is used to uniquely determine the corresponding data source, and may be a network address, a name, an index number, etc. of the data source.
  • the first data processing module can store the name and network address of the message queue, and pull data from the message queue based on the name and network address of the message queue.
  • the message queue refers to a virtual container that stores messages during the transmission of the messages.
  • the message queues can be Kafka, ActiveMQ, RabbitMQ, ZeroMQ, MetaMQ, and so on.
  • Design 2 (passive reception):
  • the data source can actively send data to the first data processing module, and the first data processing module can receive data from the data source, thereby obtaining data from the data source.
  • the data source can call the socket port to send socket data to the first data processing module, and the first data processing module can listen to the socket port to receive socket data.
  • any data processing module in the data collection system is further configured to receive the next batch of data when performing a corresponding data processing operation on any batch of data that has been received.
  • each thread in the data acquisition system uses a synchronous processing method for data processing: any thread first receives a batch of data and then processes the received data. During this period, the thread cannot receive continued data. You must wait for this. After the batch data is processed and sent to the next thread, the thread can receive the next batch of data for processing, that is, the thread will alternately perform the operation of receiving data and the operation of processing data, which can only be executed at any time. An operation.
  • any data processing module uses asynchronous processing for data processing: the data processing module can perform operations of receiving data and processing data at the same time, and the data processing module executes data on any batch of data During the processing operation, the next batch of data can be received without waiting for this batch of data to be processed.
  • the data processing module can automatically process the next batch of data that has been received, ensuring that the currently processed data does not block the inflow of the next batch of data, and thus ensuring that the data processing module Possibly always working.
  • different data processing modules in the data acquisition system can process different batches of data. For example, assuming that the third data processing module is still processing the first batch of data, and the first data processing module may have started processing the second batch of data. Data, in this way, can improve the overall collection efficiency of the data collection system.
  • any data processing module in the data acquisition system may have a corresponding memory space, where the memory space is used to buffer the received data, and any data processing module in the data acquisition system also has When performing a corresponding data processing operation on any batch of data that has been received, the next batch of data received is buffered into the memory space of any one of the data processing modules. Then, after processing any batch of data, the data processing module can directly read the next batch of data from the memory space, which greatly improves the efficiency of processing data.
  • a memory space may be placed before any data processing module, and any batch of data may first enter the memory space of the data processing module and be cached in the memory space. Then enter the data processing module from the memory space.
  • the data processing module can directly read the data from the memory space and start processing. If the data processing module is processing the previous data, you can After processing, read the data from the memory space and start processing.
  • any data processing module performs correspondence on any batch of data that has been received.
  • multiple subsequent batches of data can be received, and after any batch of data is processed, data processing operations can be performed on the received multiple batches of data in sequence. For example, if the data processing module processes the first batch of data, if it receives the second batch of data, it caches the second batch of data into the memory space, and then receives the third batch of data, and then caches the third batch of data. In the memory space, when the first batch of data is processed, the second batch of data and the third batch of data are processed in turn, and so on.
  • any data processing module in the data acquisition system performs a data processing operation, the processed data can be pushed to the memory space of the next data processing module.
  • any data processing module can store the port number of the communication port of the next data processing module, and send the processed data to the communication port of the next data processing module based on the port number.
  • the communication port is bound to the corresponding communication port, and the data processed by any processing module is automatically buffered into the memory space corresponding to the communication port.
  • the memory space described in this paragraph can refer to on-heap memory (on-heap memory).
  • the on-heap memory refers to the virtual machine managed by the node device. Memory has the advantages of easy implementation and so on. Any data processing module can be used to push the processed data to the on-chip memory of the next data processing module.
  • the memory space of the data processing module may further include off-heap memory (off-heap memory).
  • Off-heap memory refers to the memory managed by the operating system of the node device. Off-heap memory is usually It is used to store the data that the machine will send to the far end.
  • the data processing module can cache the received data in on-heap memory and cache the data on the off-heap memory to be sent to data processing modules of other node devices.
  • any data processing module can directly push the processed data to the next data processing module.
  • In-memory When any processing module and the next data processing module are located on different node devices, any data processing module can serialize the processed data and store the serialized data in the off-heap memory of any data processing module. , Push the serialized data from the off-heap memory to the on-heap memory of the next data processing module.
  • serialization refers to the process of converting the format of the data so that the format of the converted data can be transmitted over the network.
  • any data processing module may use a zero-copy (zero copy) technology to execute the step of pushing the serialized data from the off-heap memory to the on-heap memory of the next data processing module.
  • zero-copy zero copy
  • the data processing module directly uses the memory space of the operating system to receive data, the memory space does not need to be distinguished into internal space and external space.
  • the data processing module is represented by p.
  • p1 After the data processing module 1 (p1) finishes processing the data, it is assumed that p1 sends the processed data to p2 located on the same node device as p1. Then p1 can directly push the processed data to the on-chip memory of data processing module 2 (p2).
  • p1 wants to send the processed data to data processing module 3 (p3) located on another node device, then p1 serializes the processed data first, and then stores the serialized data into the off-heap memory of p1. Then, the serialized data is pushed from the off-heap memory of p1 to the on-heap memory of p3.
  • the data acquisition system may further include at least one shared memory pool, each shared memory pool includes a large amount of memory space, and each shared memory pool is used for Corresponding multiple data processing modules provide memory space.
  • any node device in a data acquisition system can deploy a shared memory pool that provides memory space for all data processing modules in the node device. Then assume that the data acquisition system is deployed on M node devices (M is A positive integer not less than 1), the data acquisition system includes M sets of data processing modules, and accordingly includes M shared memory pools.
  • the memory space in the shared memory pool can include on-heap memory and off-heap memory
  • the shared memory pool can include multiple memory pages
  • the memory page includes multiple memory segments
  • one or more memory segments in the memory page can be used as each The memory space allocated for the data processing module.
  • the operating system of each node device can request a certain size of continuous memory space as a shared memory pool created.
  • any node device can take a certain amount of memory space from the shared memory pool in advance, and evenly allocate it to each data processing module.
  • the received data can be stored in a pre-allocated memory space.
  • the data processing module may reapply for the memory space in the shared memory pool, so as to store the received data in the requested memory space. , Thereby expanding its own memory space.
  • this embodiment can support the function of memory reuse: after any data processing module applies for memory space from the shared memory pool, it can cache the received data into the requested memory space, and then from Read data from the memory space to use the requested memory space. After the requested memory space is used up, any data processing module can release the re-applied memory space back to the shared memory pool, and then the shared memory pool can reallocate the memory space to other data processing modules, or The data processing module is re-allocated to the data processing module next time when the memory space is insufficient, thereby realizing the memory reuse, improving the utilization of the memory space, and saving the memory resources.
  • each data processing module can use as many pages as possible before using the next page, so that the memory pages can be continuously released when the memory is released. To avoid memory fragmentation and waste of memory space.
  • the data acquisition system provided in the embodiments of the present application is designed with a fully asynchronous data acquisition system.
  • the first data processing module of the data acquisition system instructs the data source to provide the next batch of data, and each data is processed.
  • the module can receive and process data at the same time, to achieve the effect of asynchronous processing of different data by different data processing modules in the data collection system, to ensure that the data collection system can process multiple batches of data at the same time, to avoid the data processing module waiting for other data processing modules to execute data
  • the situation in which data can only be processed after the processing operation is completed improves the efficiency of data collection and saves the time of data collection.
  • the data acquisition system further includes a second storage source.
  • the second storage source is connected to each data processing module in the data acquisition system.
  • the physical form of the second storage source may be similar to the first storage source.
  • the second storage source may be specifically a distributed storage source. It should be noted that the second storage source and the first storage source can be different storage sources, for example, the second storage source and the first storage source can be deployed on different devices, of course, the second storage source and the first storage source can also be For the same storage source, this embodiment of the present application does not limit this.
  • the second storage source is used to store a snapshot unit of the data currently processed by the data acquisition system.
  • the snapshot unit refers to an image of the currently processed data.
  • the snapshot unit can be regarded as the data in the data acquisition system.
  • the snapshot unit is used to indicate the data processing module that currently processes the corresponding data.
  • the snapshot unit can carry the ID of the data processing module that currently processes the corresponding data, and the corresponding data processing module is indicated by the ID of the data processing module.
  • the identifier of the data processing module is used to uniquely determine the corresponding data processing module, and may be the name, number, address, etc. of the data processing module.
  • the first data processing module is also used to insert fence messages between different data from the data source, use all the data between two adjacent fence messages as a snapshot unit, and store the snapshot unit to the second Storage source.
  • the fence message is used to indicate the starting point or end point of the snapshot unit.
  • the first data processing module can divide the data of the data source into different snapshot units by inserting the fence message.
  • the fence message will not affect the normal processing operation of the data by the data processing module.
  • the data processing module can automatically skip when it encounters the fence message during the data processing process.
  • the first data processing module can insert a fence message into the data of the data source at a preset number after receiving any batch of data from the data source. Or start timing every time a fence message is inserted, and insert a fence message again when the recorded duration exceeds a preset duration.
  • the specific value of the preset number and / or the preset duration may be determined according to actual business requirements, and may be set in advance by a developer.
  • the first data processing module may add the identification of the first data processing module to the snapshot unit, and then store the added snapshot unit to the second storage source.
  • the data provided by the data source is represented by mi and the fence message is represented by bi. If the data provided by the data source is m1, m2, m3 ... m100, the first data processing module is every 4 A piece of data is inserted into a fence message bi. After the fence message is inserted, the data of the data source becomes b1, m1, m2, m3, m4, b2, m5, m6, m7, m8, b3 ... m100, the first The snapshot units are b1, m1, m2, m3, m4, the second snapshot unit is b2, m5, m6, m7, m8, b3, and so on.
  • Any one of the data processing modules in the data collection system is further configured to add an identification of the data processing module to a snapshot unit to which the data belongs in the second storage source during the processing of any data.
  • any data processing module of the data acquisition system will sequentially process the snapshot unit.
  • any one The data processing module may determine the snapshot unit to which the data belongs, and add the identifier of any one of the data processing modules to the snapshot unit in the second storage source, so as to mark the snapshot unit is being processed in the data processing module through the identification of the data processing module.
  • any data processing module can dynamically modify the snapshot unit and add the identity of the data processing module: before any data processing module processes the data in the snapshot unit, the snapshot unit will carry the data of the previous data module. ID, when the data processing module processes the data in the snapshot unit, it will modify the ID of the previous data module in the snapshot unit to the ID of the data processing module.
  • ID when the data processing module processes the data in the snapshot unit, it will modify the ID of the previous data module in the snapshot unit to the ID of the data processing module.
  • the data in the corresponding snapshot unit in the second storage source can also be replaced with the modified data. Data to refresh the snapshot unit. If the data processing module does not modify the received data, there is no need to modify the snapshot unit.
  • any batch of data from the storage source when the batch of data has been processed by multiple data processing modules and stored by the last data processing module to the first storage source, it can be considered that this batch of data has reached For processing and storage purposes, even if the data acquisition system is down at this time, the batch of processed data can be obtained from the first storage source at any time after restarting, so the second storage source does not need to continue to store the snapshot unit of this batch of data. . Therefore, after the last processing module stores any batch of processed data in the first storage source, it can determine the snapshot unit to which the processed data belongs in the second storage source, and delete the snapshot unit from the second storage source. To save the storage space of the second storage source.
  • this embodiment implements a snapshot mechanism.
  • the function of resuming the breakpoint can be realized: while the data acquisition system is running Under the influence of factors such as network abnormality and equipment failure, any data processing module may go down.
  • any data processing module can be based on the snapshot unit in the second storage source. , Resume data processing before the downtime processing of data, without having to start from the beginning, re-processing the first batch of data.
  • any data processing module can search in the second storage source based on the identity of the data processing module, and obtain the added data processing from the second storage source.
  • the data in the at least one snapshot unit is the data processed by the data processing module before the system goes down.
  • the data processing module can obtain the at least one snapshot unit from the second storage source, and Processing of the data of the at least one snapshot unit is continued.
  • this embodiment provides the following two designs, which can ensure that the data acquisition system maintains the stability of the system when facing a traffic peak.
  • the back pressure mechanism may be specifically implemented by the following (1) and (2):
  • the first data processing module reduces the speed of obtaining data from the data source.
  • the first data processing module can detect whether the node device that meets the preset conditions in real time during the data collection process. When the node device meets the preset conditions, it can be learned that the memory space of the node device is insufficient. The speed of obtaining data, so as to avoid excessive consumption of the memory space of the node device, thereby avoiding affecting the performance of the node device.
  • the preset condition may include any combination of the following condition 1 and condition 2:
  • the shared memory pool of a node device can be regarded as the total memory space prepared by the node device for all internal data acquisition modules.
  • the shared memory pool currently occupies too much memory, it indicates the amount of data currently cached by each data processing module in the node device. It is too large and exceeds the data processing capacity of each data processing module.
  • the first data processing module can detect in real time whether the memory currently occupied by the corresponding shared memory pool reaches the first preset threshold.
  • the memory currently occupied by the shared memory pool reaches the first preset threshold, it is determined that the node device meets the preset Conditions to reduce the speed of obtaining data from the data source, then, as each data processing module finishes processing the data, the memory space is gradually released back to the shared memory pool, and the memory in the shared memory pool will gradually fill up and return to the current A state in which the occupied memory is less than a first preset threshold. Then, the shared memory pool can maintain sufficient memory to ensure the normal operation of each data processing module in the node device.
  • the first preset threshold may be determined according to business requirements, and may be set by a developer.
  • Condition 2 The total number of threads currently applying for memory of the data processing module of the node device reaches a second preset threshold.
  • Each thread can request memory from the shared memory pool to store the received data. If there are too many threads requesting memory, it indicates that the data currently cached by each data processing module in the node device. The volume has been too large, exceeding the data processing capacity of each data processing module.
  • the first data processing module can detect in real time whether the total number of threads currently applying for memory of the data processing module in the node device reaches the second preset threshold, and if the total number of threads currently applying for memory reaches the second preset threshold
  • the threshold value determines that the node device meets a preset condition and reduces the speed of obtaining data from the data source.
  • the node device where the first data processing module p1 is located includes three data processing modules: p1, p2, and p3, p1 can detect whether the total number of threads that apply for memory corresponding to p1, p2, and p3 reaches the second Preset threshold.
  • the second preset threshold may be determined according to business requirements, and may be set by a developer.
  • the first data processing module can divide the original pull speed by the pre- Set the multiple to get the reduced pull speed, pull the data according to the reduced pull speed, and then slow down the data pull speed. For example, the first data processing module originally pulls data every 1s. After reducing the pull speed, the first data processing module can pull data every 5s.
  • the first data processing module can notify the data source to reduce the speed of sending data. After the data source is notified, it can reduce the speed of sending data, so that the first A data processing module can reduce the speed of obtaining data.
  • the first data processing module can send a notification message to the data source, and notify the data source to reduce the speed of sending data through the notification message.
  • Any data processing module other than the first data processing module in the data acquisition system is also used to instruct the first data processing module to reduce the speed of obtaining data from the data source when the corresponding node device meets the preset conditions.
  • the first data processing module can be thought of as a faucet that controls the data flow rate in the entire system.
  • the first data processing module can be considered as the source of the data of each subsequent data processing module.
  • the speed at which the first data processing module obtains data directly affects the speed at which each subsequent data processing module obtains data.
  • any data processing module other than the first data processing module can detect whether the corresponding node device satisfies a preset condition, and when the corresponding node device meets the preset condition, instruct the first data processing module to reduce the amount of data from the data source. Speed of getting data. After the first data processing module is instructed, it will reduce the speed of obtaining data from the data source. Then, because the speed of obtaining data by the first data processing module is reduced, each subsequent data processing module obtains data due to a chain reaction. The speed will also decrease.
  • any data processing module may send a notification message to the first data processing module, and the notification message is used to notify Reduce the speed of obtaining data from the data source.
  • the first data processing module receives the notification message to determine that the speed of obtaining data from the data source should be reduced.
  • any data processing module other than the first data processing module detects that the corresponding node device satisfies a preset condition is performed together with the above steps, and details are not described herein.
  • this embodiment also provides a mechanism for the data collection system to recover from the backpressure state to the original speed of data collection.
  • the mechanism for recovering from the back pressure state may specifically include the following (a) and (b).
  • the first data processing module After the first data processing module reduces the speed of obtaining data from the data source, when the number of times that the node device where the first data processing module is located does not meet the preset conditions reaches the preset number of times, the first data processing module The speed at which the module recovers data from the data source.
  • This design corresponds to (1) of the back pressure mechanism.
  • the first data processing module reduces the speed of obtaining data from the data source, the data source data flow into the first data processing module will slow down, and the memory space of the node device of the first data processing module will gradually recover. sufficient.
  • the first data processing module can detect whether the node device meets the preset condition again, and when the node device meets the preset condition, accumulate the number of times the node device meets the preset condition. When the number of times the node device meets the preset conditions has not reached the preset number of times, the memory space of the node device is considered insufficient. At this time, if the speed of obtaining data from the data source before recovery may cause the node device to enter the back pressure state soon.
  • the preset number of times may be determined according to actual service requirements, and may be two times.
  • any data processing module other than the first data processing module in the data acquisition system instructs the first data processing module to reduce the speed of obtaining data from the data source
  • the node device where the data processing module is located is not satisfied
  • any one of the data processing modules indicates to the first data processing module the speed at which the first data processing module resumes acquiring data from the data source.
  • Any one of the data processing modules may send a recovery message to the first data processing module, and the recovery message is used to instruct the first data processing module to recover the speed of obtaining data from the data source.
  • This design corresponds to (2) of the back pressure mechanism.
  • the speed at which data from the data source flows into the first data processing module will slow down.
  • the processing module is affected by the first data processing module.
  • the speed at which data flows into any one of the data processing modules will slow down, and the memory space of the node device where the data processing module is located will gradually recover to sufficient.
  • any data processing module can similarly detect whether the number of times that the node device where any data processing module does not meet the preset conditions reaches the preset number of times.
  • any one of the data processing modules will send a recovery message to the first data processing module. After receiving the recovery message, the first data processing module will recover data before entering the back pressure state. Speed, then as the speed of obtaining data by the first data processing module is accelerated, any one of the data processing modules is affected by the chain reaction, and the speed of obtaining data will also be accelerated.
  • any data processing module in the data acquisition system can generate at least one thread, and the processing logic of the data processing operation can be written into each thread.
  • the data processing module needs to process any In a batch of data
  • different data in the batch of data can be allocated to different threads according to needs, and at least one thread can be controlled to process data concurrently, thereby improving the efficiency of processing data.
  • any data processing module may perform a corresponding data processing operation on any batch of data based on at least one thread. Obviously, the greater the number of threads in the data processing module, the faster the data can be processed.
  • the amount of concurrency refers to the number of threads that process data in the data processing module. The larger the amount of concurrency, the data processing module processes The faster the data speed, the more concurrent the data processing will be, reducing the pressure on processing large amounts of data.
  • any data processing module can detect the current speed of receiving data and determine whether the speed of receiving data reaches a preset speed threshold. When the speed of receiving data has reached a preset speed threshold, it indicates that the data is fast. Flowing into the data processing module, it is determined that a traffic peak has occurred.
  • any data processing module can detect the current speed of receiving data, calculate the difference between the speed of receiving data and the speed of historical receiving data. When the difference has reached a preset difference threshold, it indicates that the amount of data is compared The rapid growth in the history collection process determines that a traffic peak has occurred.
  • the data processing module can also detect traffic peaks in other ways, which is not limited.
  • any one of the data processing modules increases the amount of concurrency, if the traffic peak is no longer detected, the amount of concurrency can be reduced, thereby restoring the amount of concurrency to the previous amount of concurrency.
  • the data processing module may choose to execute either of the first design and the second design. For example, after detecting the peak value of the traffic, it is possible to determine whether the node device meets the preset conditions. Design 2 is performed when the conditions are preset, and Design 1 is performed when the node device meets the preset conditions. Of course, the data processing module can also execute Design 1 and Design 2 at the same time.
  • FIG. 10 is a flowchart of a data collection method according to an embodiment of the present application. The method is applied to a data collection system of a data collection system, and may be implemented by each data processing module in the data collection system interactively. Referring to FIG. 10, the method includes the following steps:
  • the first data processing module in the data acquisition system obtains any batch of data from the data source, the first data processing module instructs the data source to provide the next batch of data.
  • any one of the data processing modules in the data acquisition system performs a corresponding data processing operation on any batch of data that has been received, the any data processing module receives the next batch of data.
  • the last data processing module in the data acquisition system stores the processed data in the first storage source.
  • the method provided in this embodiment provides a fully asynchronous data collection method.
  • the first data processing module of the data collection system instructs the data source to provide the next batch of data, and each data processing module can receive and process data simultaneously.
  • each data processing module can receive and process data simultaneously.
  • any data processing module in the data acquisition system has a corresponding memory space, and the memory space is used to buffer the received data;
  • any data processing module When any data processing module performs a corresponding data processing operation on any batch of data that has been received, the any data processing module receives the next batch of data, including:
  • any one data processing module When any one data processing module performs a corresponding data processing operation on any batch of data that has been received, the any data processing module buffers the next batch of received data into the memory space of any one data processing module;
  • any data processing module After any data processing module finishes processing any batch of data, the any data processing module reads the next batch of data from the memory space.
  • the data acquisition system further includes at least one shared memory pool, and each shared memory pool is used to provide a memory space for a corresponding plurality of data processing modules.
  • any data processing module when any data processing module performs a corresponding data processing operation on any batch of data that has been received, the any data processing module receives the next batch of data, including:
  • any data processing module releases the memory space obtained by the application back to the shared memory pool.
  • the memory space of any data processing module in the data acquisition system includes on-chip memory
  • the method further includes:
  • Any one of the data processing modules pushes the processed data to the on-chip memory of the next data processing module.
  • the memory space of any data processing module in the data acquisition system further includes off-heap memory
  • Any one of the data processing modules pushes the processed data to the on-chip memory of the next data processing module, including:
  • the data processing module pushes the processed data to the on-chip memory of the next data processing module;
  • the any data processing module serializes the processed data and stores the serialized data to the off-heap memory of the any data processing module In the process, the serialized data is pushed from the off-heap memory to the on-heap memory of the next data processing module.
  • the data acquisition system further includes a second storage source, where the second storage source is used to store a snapshot unit of data currently processed by the data acquisition system, and the snapshot unit is used to indicate the current processing of the corresponding data.
  • Data processing module is used to store a snapshot unit of data currently processed by the data acquisition system, and the snapshot unit is used to indicate the current processing of the corresponding data.
  • the method also includes:
  • any data processing module When any data processing module is down and restarted, the any data processing module resumes data processing on the data processed before the downtime based on the snapshot unit in the second storage source.
  • the snapshot mechanism is combined on the basis of a fully asynchronous data acquisition system, which can realize the function of resumed transmission at a breakpoint.
  • the data processing module can be based on the snapshot of the second storage source To restore the data processing of the data processed before the downtime, avoid the situation that the data must be restarted due to the downtime, and improve the robustness and reliability of the data collection system.
  • the method before any one of the data processing modules is based on the snapshot unit in the second storage source, and before the data processing of the data processed before the downtime is resumed, the method further includes:
  • the first data processing module inserts a fence message between different data of the data source
  • the first data processing module uses all data between two adjacent fence messages as a snapshot unit
  • the first data processing module stores the snapshot unit in the second storage source, wherein the fence message is used to indicate a starting point or an end point of the snapshot unit.
  • the method before any one of the data processing modules is based on the snapshot unit in the second storage source, and before the data processing of the data processed before the downtime is resumed, the method further includes:
  • the any data processing module adds the identity of the data processing module to the snapshot unit to which the data belongs in the second storage source.
  • the method further includes:
  • the first data processing module When the node device where the first data processing module is located meets a preset condition, the first data processing module reduces the speed of obtaining data from the data source.
  • the method further includes:
  • any data processing module other than the first data processing module instructs the first data processing module to reduce The speed at which data is retrieved from the data source.
  • the backpressure mechanism is combined on the basis of a fully asynchronous data acquisition system.
  • the first data processing module will reduce the speed of obtaining data from the data source. This can alleviate the lack of memory when facing traffic peaks, and improve the robustness and reliability of the data acquisition system.
  • the preset condition includes that a memory currently occupied by a node device's shared memory pool reaches a first preset threshold, and the shared memory pool is used to provide memory space for multiple data processing modules of the node device, This memory space is used to buffer the data that has been received; and / or,
  • the preset condition includes that the total number of threads currently applying for memory of the data processing module of the node device reaches a second preset threshold.
  • any data processing module in the data acquisition system is further configured to perform a corresponding data processing operation on any batch of data based on at least one thread;
  • the method also includes:
  • Any one of the data processing modules is also used to increase the amount of concurrency when a traffic peak is detected, and the amount of concurrency refers to the number of threads that process data in the data processing module.
  • the multiple data processing modules are respectively located in multiple node devices.
  • the data collection method provided by this embodiment belongs to the same concept as the embodiment of the data collection system provided by the embodiment shown in FIG. 1 above, and the specific process thereof is described in detail in the embodiment shown in FIG.
  • FIG. 11 is a schematic structural diagram of a node device according to an embodiment of the present application.
  • the node device 1100 may have a large difference due to different configurations or performance, and may include one or more processors (central processing units) (CPU) 1101. And one or more memories 1102, where at least one instruction is stored in the memory 1102, and the at least one instruction is loaded and executed by the processor 1101 to implement the methods provided by the foregoing method embodiments.
  • the node device may also have components such as a wired or wireless network interface and an input-output interface for input and output, and the node device may further include other components for implementing device functions, and details are not described herein.
  • a computer-readable storage medium such as a memory including instructions, and the foregoing instructions may be executed by a processor in a node device to complete the data collection method in the foregoing embodiment.
  • the computer-readable storage medium may be Read-Only Memory (ROM), Random Access Memory (RAM), Compact Disc (Read-Only Memory, CD-ROM), Magnetic tapes, floppy disks, and optical data storage devices.
  • the program may be stored in a computer-readable storage medium.
  • the storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk.
  • the present application further provides a computer program product containing instructions, which when executed on a node device, enables the node device to implement the data collection method in the foregoing embodiment.
  • the present application further provides a chip that includes a processor and / or program instructions.
  • the chip runs, the data acquisition method in the foregoing embodiment is implemented.
  • the program may be stored in a computer-readable storage medium.
  • the storage medium may be a read-only memory, a magnetic disk or an optical disk.

Abstract

The present application relates to the technical field of big data, and disclosed thereby are a data acquisition system and method, a node device and a storage medium. A first data processing module in the data collection system provided by the present application is used for instructing a data source to provide the next batch of data when acquiring any batch of data of the data source; any data processing module in the data collection system is used for receiving the next batch of data while executing a corresponding data processing operation on the received any batch of data; and the last data processing module in the data collection system is used for storing the processed data in a first storage source. In the present application, a fully asynchronous system architecture is designed to ensure that the data acquisition system may process multiple batches of data at the same time, thereby avoiding the situation of a data processing module having to wait for another data processing module to complete a data processing operation before starting to process data, thus improving the data acquisition efficiency.

Description

数据采集系统、方法、节点设备及存储介质Data acquisition system, method, node equipment and storage medium
本申请要求于2018年05月25日提交的申请号为201810515496.4、发明名称为“数据采集系统、方法、节点设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority from a Chinese patent application filed on May 25, 2018 with application number 201810515496.4 and the invention name is "Data Acquisition System, Method, Node Device, and Storage Medium", the entire contents of which are incorporated herein by reference. in.
技术领域Technical field
本申请涉及大数据技术领域,特别涉及一种数据采集系统、方法、节点设备及存储介质。The present application relates to the field of big data technology, and in particular, to a data collection system, method, node device, and storage medium.
背景技术Background technique
随着大数据技术的发展以及网络中海量数据的飞速增长,数据采集的挑战变得尤为突出。数据采集是指对数据源中的数据经过一系列的处理操作,最终将处理操作后的数据存储至存储源的过程。通过进行数据采集,可以帮助人们管理、分析和挖掘数据,具有巨大的经济和应用价值。With the development of big data technology and the rapid growth of massive data in the network, the challenge of data collection has become more prominent. Data collection refers to the process of processing data in a data source through a series of processing operations, and finally storing the processed data to a storage source. Data collection can help people manage, analyze, and mine data, and it has great economic and application value.
目前数据采集系统通常采用单机多线程的架构,基于同步方式进行数据采集:数据采集系统包括多个线程,每个线程用于执行一种数据处理操作,在数据采集的过程中,当数据源提供了一批数据时,第一个线程从数据源拉取这批数据,对数据进行处理后,将处理后的数据发送给第二个线程,第二个线程接收第一个线程的数据,对数据进行处理后,将处理后的数据发送给第三个线程,依次类推,最后一个线程接收数据并对数据进行处理后,会将处理后的数据存储至存储源中。之后,最后一个线程会通知数据源其提供的这批数据已经成功存储入库,数据源得到通知后,会提供下一批数据,则第一个线程会继续会从数据源拉取下一批数据,依次类推。At present, data acquisition systems usually adopt a single-machine multi-threaded architecture and perform data acquisition based on a synchronous method: The data acquisition system includes multiple threads, and each thread is used to perform a data processing operation. During the data acquisition process, when the data source provides When a batch of data is obtained, the first thread pulls the batch of data from the data source, processes the data, and sends the processed data to the second thread. The second thread receives the data from the first thread. After the data is processed, the processed data is sent to a third thread, and so on. After the last thread receives the data and processes the data, the processed data is stored in the storage source. After that, the last thread will notify the data source that the batch of data it provided has been successfully stored in the library. After the data source is notified, it will provide the next batch of data, and the first thread will continue to pull the next batch from the data source. Data, and so on.
在实现本申请的过程中,发明人发现相关技术至少存在以下问题:In the process of implementing this application, the inventors found that the related technology has at least the following problems:
任一时刻整个数据采集系统只能针对一批数据进行处理,每个线程均要等待其他线程处理这批数据完毕后才能开始接收并处理下一批数据,数据采集的效率极低。At any time, the entire data collection system can only process a batch of data. Each thread must wait for other threads to process the batch of data before it can start receiving and processing the next batch of data. The efficiency of data collection is extremely low.
发明内容Summary of the Invention
本申请实施例提供了一种数据采集系统、方法、节点设备及存储介质,可以解决相关技术中数据采集的效率极低的问题。所述技术方案如下:The embodiments of the present application provide a data collection system, method, node device, and storage medium, which can solve the problem of extremely low data collection efficiency in related technologies. The technical solution is as follows:
一方面,提供了一种数据采集系统,所述数据采集系统包括多个数据处理模块;In one aspect, a data acquisition system is provided, and the data acquisition system includes multiple data processing modules;
所述数据采集系统中第一个数据处理模块用于当获取到数据源的任一批数据时,指示所述数据源提供下一批数据;The first data processing module in the data acquisition system is used to instruct the data source to provide the next batch of data when any batch of data from the data source is obtained;
所述数据采集系统中任一个数据处理模块用于对已接收到的任一批数据执行对应的数据处理操作时,接收下一批数据;Any data processing module in the data acquisition system is configured to receive the next batch of data when performing a corresponding data processing operation on any batch of data that has been received;
所述数据采集系统中最后一个数据处理模块用于将处理后的数据存储至第一存储 源中。The last data processing module in the data acquisition system is used to store the processed data in a first storage source.
在一种可能的设计中,所述数据采集系统中任一个数据处理模块具有对应的内存空间,所述内存空间用于缓存已接收到的数据;In a possible design, any data processing module in the data acquisition system has a corresponding memory space, and the memory space is used to buffer the received data;
所述数据采集系统中任一个数据处理模块还用于对已接收到的任一批数据执行对应的数据处理操作时,将接收到的下一批数据缓存至所述任一个数据处理模块的内存空间中,当处理所述任一批数据完成后,从所述内存空间中读取所述下一批数据。Any one of the data processing modules in the data acquisition system is further configured to cache the next batch of received data to the memory of any one of the data processing modules when performing a corresponding data processing operation on any batch of data that has been received In the space, when the processing of any one batch of data is completed, the next batch of data is read from the memory space.
在一种可能的设计中,所述数据采集系统还包括至少一个共享内存池,每个共享内存池用于为对应的多个数据处理模块提供内存空间。In a possible design, the data acquisition system further includes at least one shared memory pool, and each shared memory pool is used to provide a memory space for a corresponding plurality of data processing modules.
在一种可能的设计中,所述任一个数据处理模块还用于从对应的共享内存池中申请内存空间,当申请得到的内存空间使用完毕后,将所述申请得到的内存空间释放回所述共享内存池中。In a possible design, any one of the data processing modules is further configured to apply for a memory space from a corresponding shared memory pool. When the memory space obtained by the application is used up, the memory space obtained by the application is released back to the server. Described in the shared memory pool.
在一种可能的设计中,所述数据采集系统中任一个数据处理模块的内存空间包括堆内内存;In a possible design, the memory space of any data processing module in the data acquisition system includes on-chip memory;
所述任一个数据处理模块还用于将处理后的数据推送至下一个数据处理模块的堆内内存中。Any one of the data processing modules is further configured to push the processed data to a heap memory of a next data processing module.
在一种可能的设计中,所述数据采集系统中任一个数据处理模块的内存空间还包括堆外内存;In a possible design, the memory space of any data processing module in the data acquisition system further includes off-heap memory;
所述任一个数据处理模块还用于当所述任一个数据处理模块和下一个数据处理模块位于同一节点设备时,将处理后的数据推送至所述下一个数据处理模块的堆内内存中;或,The any data processing module is further configured to push the processed data to the in-heap memory of the next data processing module when the any data processing module and the next data processing module are located on the same node device; or,
所述任一个数据处理模块还用于当所述任一个处理模块和下一个数据处理模块位于不同节点设备时,对处理后的数据进行序列化,将序列化的数据存储至所述任一个数据处理模块的堆外内存中,将所述序列化后的数据从所述堆外内存推送至下一个数据处理模块的堆内内存中。The any data processing module is further configured to serialize the processed data and store the serialized data to the any data when the any processing module and the next data processing module are located on different node devices. The off-heap memory of the processing module pushes the serialized data from the off-heap memory to the on-heap memory of the next data processing module.
在一种可能的设计中,所述数据采集系统还包括第二存储源,所述第二存储源用于存储所述数据采集系统当前处理的数据的快照单元,所述快照单元用于指示当前处理对应数据的数据处理模块;In a possible design, the data acquisition system further includes a second storage source, where the second storage source is used to store a snapshot unit of data currently processed by the data acquisition system, and the snapshot unit is used to indicate the current Data processing module for processing corresponding data;
任一个数据处理模块还用于当所述任一个数据处理模块宕机并重启后,基于所述第二存储源中的快照单元,恢复对宕机前处理的数据进行数据处理。Any one of the data processing modules is further configured to resume data processing on the data processed before the outage based on the snapshot unit in the second storage source after the any one of the data processing modules is down and restarted.
本设计中,在全异步式的数据采集系统的基础上结合了快照机制,能够实现断点续传的功能,当数据处理模块宕机并重启后,数据处理模块可以基于第二存储源的快照,恢复对宕机前处理的数据进行数据处理,避免由于宕机导致必须重新开始处理数据的情况,提高数据采集系统的鲁棒性和可靠性。In this design, the snapshot mechanism is combined on the basis of a fully asynchronous data acquisition system, which can realize the function of resumed transmission at a breakpoint. When the data processing module is down and restarted, the data processing module can be based on the snapshot of the second storage source To restore the data processing of the data processed before the downtime, avoid the situation that the data must be restarted due to the downtime, and improve the robustness and reliability of the data collection system.
在一种可能的设计中,所述数据采集系统中第一个数据处理模块还用于在所述数据源的不同数据之间插入栏栅消息,将相邻两个栏栅消息之间的所有数据作为一个快照单元,将所述快照单元存储至所述第二存储源中,其中所述栏栅消息用于指示快照单元的起始点或结束点。In a possible design, the first data processing module in the data acquisition system is further configured to insert a fence message between different data of the data source, and The data is used as a snapshot unit, and the snapshot unit is stored in the second storage source, where the fence message is used to indicate a starting point or an end point of the snapshot unit.
在一种可能的设计中,所述数据采集系统中任一个数据处理模块还用于在处理任一数据的过程中,向所述第二存储源中所述数据所属的快照单元添加所述数据处理模 块的标识。In a possible design, any data processing module in the data acquisition system is further configured to add the data to a snapshot unit to which the data belongs in the second storage source during the processing of any data. The ID of the processing module.
在一种可能的设计中,所述第一个数据处理模块还用于当所述第一个数据处理模块所在的节点设备满足预设条件时,降低从所述数据源中获取数据的速度。In a possible design, the first data processing module is further configured to reduce a speed of obtaining data from the data source when a node device where the first data processing module is located meets a preset condition.
在一种可能的设计中,所述数据采集系统中所述第一个数据处理模块以外的任一个数据处理模块还用于在对应的节点设备满足预设条件时,指示所述第一个数据处理模块降低从所述数据源中获取数据的速度。In a possible design, any data processing module other than the first data processing module in the data acquisition system is further configured to indicate the first data when a corresponding node device meets a preset condition The processing module reduces the speed of obtaining data from the data source.
基于上述设计,在全异步式的数据采集系统的基础上结合了反压机制,当数据采集系统的节点设备满足预设条件时,第一个数据处理模块会降低从数据源获取数据的速度,从而在面临流量峰值时缓解内存不足的情况,提高数据采集系统的鲁棒性和可靠性。Based on the above design, the backpressure mechanism is combined on the basis of a fully asynchronous data acquisition system. When the node equipment of the data acquisition system meets the preset conditions, the first data processing module will reduce the speed of obtaining data from the data source. This can alleviate the lack of memory when facing traffic peaks, and improve the robustness and reliability of the data acquisition system.
在一种可能的设计中,所述预设条件包括节点设备的共享内存池当前占用的内存达到第一预设阈值,所述共享内存池用于为所述节点设备的多个数据处理模块提供内存空间,所述内存空间用于缓存已接收到的数据;和/或,In a possible design, the preset condition includes that a memory currently occupied by a shared memory pool of a node device reaches a first preset threshold, and the shared memory pool is used to provide multiple data processing modules of the node device. Memory space for buffering received data; and / or,
所述预设条件包括节点设备的数据处理模块的当前申请内存的线程的总数量达到第二预设阈值。The preset condition includes that the total number of threads currently applying for memory of the data processing module of the node device reaches a second preset threshold.
在一种可能的设计中,所述数据采集系统中任一个数据处理模块还用于基于至少一个线程,对任一批数据执行对应的数据处理操作;In a possible design, any data processing module in the data acquisition system is further configured to perform a corresponding data processing operation on any batch of data based on at least one thread;
所述任一个数据处理模块还用于当检测到流量峰值时增加并发量,所述并发量是指数据处理模块中对数据进行处理的线程的数量。Any one of the data processing modules is further configured to increase the amount of concurrency when a traffic peak is detected. The amount of concurrency refers to the number of threads that process data in the data processing module.
在一种可能的设计中,所述多个数据处理模块分别位于多个节点设备中。In a possible design, the multiple data processing modules are respectively located in multiple node devices.
另一方面,提供了一种数据采集方法,所述方法应用于数据采集系统中,所述数据采集系统包括多个数据处理模块,所述方法包括:In another aspect, a data acquisition method is provided. The method is applied to a data acquisition system. The data acquisition system includes multiple data processing modules. The method includes:
当第一个数据处理模块获取数据源的任一批数据时,所述第一个数据处理模块指示所述数据源提供下一批数据;When the first data processing module obtains any batch of data from the data source, the first data processing module instructs the data source to provide the next batch of data;
当任一个数据处理模块对已接收到的任一批数据执行对应的数据处理操作时,所述任一个数据处理模块接收下一批数据;When any data processing module performs a corresponding data processing operation on any batch of data that has been received, the any data processing module receives the next batch of data;
最后一个数据处理模块将处理后的数据存储至第一存储源中。The last data processing module stores the processed data in the first storage source.
在一种可能的设计中,所述数据采集系统中任一个数据处理模块具有对应的内存空间,所述内存空间用于缓存已接收到的数据;In a possible design, any data processing module in the data acquisition system has a corresponding memory space, and the memory space is used to buffer the received data;
所述当任一个数据处理模块对已接收到的任一批数据执行对应的数据处理操作时,所述任一个数据处理模块接收下一批数据,包括:When any data processing module performs a corresponding data processing operation on any batch of data that has been received, said any data processing module receiving the next batch of data includes:
所述任一个数据处理模块对已接收到的任一批数据执行对应的数据处理操作时,所述任一个数据处理模块将接收到的下一批数据缓存至所述任一个数据处理模块的内存空间中;When the any data processing module performs a corresponding data processing operation on any batch of data that has been received, the any data processing module buffers the next batch of received data to the memory of the any data processing module In space
当所述任一个数据处理模块处理所述任一批数据完成后,所述任一个数据处理模块从所述内存空间中读取所述下一批数据。When any one of the data processing modules finishes processing any of the batches of data, the any one of the data processing modules reads the next batch of data from the memory space.
在一种可能的设计中,所述数据采集系统还包括至少一个共享内存池,每个共享内存池用于为对应的多个数据处理模块提供内存空间。In a possible design, the data acquisition system further includes at least one shared memory pool, and each shared memory pool is used to provide a memory space for a corresponding plurality of data processing modules.
在一种可能的设计中,所述当任一个数据处理模块对已接收到的任一批数据执行对应的数据处理操作时,所述任一个数据处理模块接收下一批数据,包括:In a possible design, when any data processing module performs a corresponding data processing operation on any batch of data that has been received, the any data processing module receives the next batch of data, including:
所述任一个数据处理模块从对应的共享内存池中申请内存空间;Any one of the data processing modules applies for a memory space from a corresponding shared memory pool;
当申请得到的内存空间使用完毕后,所述任一个数据处理模块将所述申请得到的内存空间释放回所述共享内存池中。When the memory space obtained by the application is used up, any one of the data processing modules releases the memory space obtained by the application back to the shared memory pool.
在一种可能的设计中,所述数据采集系统中任一个数据处理模块的内存空间包括堆内内存;In a possible design, the memory space of any data processing module in the data acquisition system includes on-chip memory;
所述第一个数据处理模块指示所述数据源提供下一批数据之后,所述方法还包括:After the first data processing module instructs the data source to provide the next batch of data, the method further includes:
所述任一个数据处理模块将处理后的数据推送至下一个数据处理模块的堆内内存中。Any one of the data processing modules pushes the processed data to the on-chip memory of the next data processing module.
在一种可能的设计中,所述数据采集系统中任一个数据处理模块的内存空间还包括堆外内存;In a possible design, the memory space of any data processing module in the data acquisition system further includes off-heap memory;
所述任一个数据处理模块将处理后的数据推送至下一个数据处理模块的堆内内存中,包括:Pushing the processed data to the in-heap memory of the next data processing module by any one of the data processing modules includes:
当所述任一个数据处理模块和下一个数据处理模块位于同一节点设备时,所述任一个数据处理模块将处理后的数据推送至所述下一个数据处理模块的堆内内存中;或,When any one of the data processing modules and the next data processing module are located on the same node device, the any data processing module pushes the processed data to the on-chip memory of the next data processing module; or,
当所述任一个处理模块和下一个数据处理模块位于不同节点设备时,所述任一个数据处理模块对处理后的数据进行序列化,将序列化的数据存储至所述任一个数据处理模块的堆外内存中,将所述序列化后的数据从所述堆外内存推送至下一个数据处理模块的堆内内存中。When any one of the processing modules and the next data processing module are located at different node devices, the any data processing module serializes the processed data and stores the serialized data to the data processing module of any one of the data processing modules. In the off-heap memory, the serialized data is pushed from the off-heap memory to the on-heap memory of a next data processing module.
在一种可能的设计中,所述数据采集系统还包括第二存储源,所述第二存储源用于存储所述数据采集系统当前处理的数据的快照单元,所述快照单元用于指示当前处理对应数据的数据处理模块;In a possible design, the data acquisition system further includes a second storage source, where the second storage source is used to store a snapshot unit of data currently processed by the data acquisition system, and the snapshot unit is used to indicate the current Data processing module for processing corresponding data;
所述方法还包括:The method further includes:
当任一个数据处理模块宕机并重启后,所述任一个数据处理模块基于所述第二存储源中的快照单元,恢复对宕机前处理的数据进行数据处理。When any one of the data processing modules is down and restarted, the any one of the data processing modules resumes data processing on the data processed before the downtime based on the snapshot unit in the second storage source.
在一种可能的设计中,所述任一个数据处理模块基于所述第二存储源中的快照单元,恢复对宕机前处理的数据进行数据处理之前,所述方法还包括:In a possible design, before any one of the data processing modules is based on the snapshot unit in the second storage source, and before restoring data processing to data processed before the outage, the method further includes:
第一个数据处理模块在所述数据源的不同数据之间插入栏栅消息;A first data processing module inserts a fence message between different data of the data source;
所述第一个数据处理模块将相邻两个栏栅消息之间的所有数据作为一个快照单元;The first data processing module uses all data between two adjacent fence messages as a snapshot unit;
所述第一个数据处理模块将所述快照单元存储至所述第二存储源中,其中所述栏栅消息用于指示快照单元的起始点或结束点。The first data processing module stores the snapshot unit to the second storage source, wherein the fence message is used to indicate a start point or an end point of the snapshot unit.
在一种可能的设计中,所述任一个数据处理模块基于所述第二存储源中的快照单元,恢复对宕机前处理的数据进行数据处理之前,所述方法还包括:In a possible design, before any one of the data processing modules is based on the snapshot unit in the second storage source, and before restoring data processing to data processed before the outage, the method further includes:
任一个数据处理模块在处理任一数据的过程中,所述任一个数据处理模块向所述第二存储源中所述数据所属的快照单元添加所述数据处理模块的标识。In the process of processing any data by any data processing module, the any data processing module adds the identity of the data processing module to a snapshot unit to which the data belongs in the second storage source.
在一种可能的设计中,所述方法还包括:In a possible design, the method further includes:
当所述第一个数据处理模块所在的节点设备满足预设条件时,所述第一个数据处理模块降低从所述数据源中获取数据的速度。When the node device where the first data processing module is located meets a preset condition, the first data processing module reduces the speed of obtaining data from the data source.
在一种可能的设计中,所述方法还包括:In a possible design, the method further includes:
当所述第一个数据处理模块以外的任一个数据处理模块对应的节点设备满足预设条件时,所述第一个数据处理模块以外的任一个数据处理模块指示所述第一个数据处理模块降低从所述数据源中获取数据的速度。When the node device corresponding to any data processing module other than the first data processing module meets a preset condition, any data processing module other than the first data processing module instructs the first data processing module Reduce the speed of obtaining data from the data source.
在一种可能的设计中,所述预设条件包括节点设备的共享内存池当前占用的内存达到第一预设阈值,所述共享内存池用于为所述节点设备的多个数据处理模块提供内存空间,所述内存空间用于缓存已接收到的数据;和/或,In a possible design, the preset condition includes that a memory currently occupied by a shared memory pool of a node device reaches a first preset threshold, and the shared memory pool is used to provide multiple data processing modules of the node device. Memory space for buffering received data; and / or,
所述预设条件包括节点设备的数据处理模块的当前申请内存的线程的总数量达到第二预设阈值。The preset condition includes that the total number of threads currently applying for memory of the data processing module of the node device reaches a second preset threshold.
在一种可能的设计中,所述数据采集系统中任一个数据处理模块还用于基于至少一个线程,对任一批数据执行对应的数据处理操作;In a possible design, any data processing module in the data acquisition system is further configured to perform a corresponding data processing operation on any batch of data based on at least one thread;
所述方法还包括:The method further includes:
所述任一个数据处理模块还用于当检测到流量峰值时增加并发量,所述并发量是指数据处理模块中对数据进行处理的线程的数量。Any one of the data processing modules is further configured to increase the amount of concurrency when a traffic peak is detected. The amount of concurrency refers to the number of threads that process data in the data processing module.
在一种可能的设计中,所述多个数据处理模块分别位于多个节点设备中。In a possible design, the multiple data processing modules are respectively located in multiple node devices.
另一方面,提供了一种节点设备,所述节点设备包括处理器和存储器,所述存储器中存储有至少一条指令,所述指令由所述处理器加载并执行以实现上述数据采集方法所执行的操作。In another aspect, a node device is provided. The node device includes a processor and a memory. The memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement the data acquisition method. Operation.
另一方面,提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令,所述指令由所述处理器加载并执行以实现上述数据采集方法所执行的操作。In another aspect, a computer-readable storage medium is provided. The storage medium stores at least one instruction, and the instruction is loaded and executed by the processor to implement the operations performed by the data collection method.
另一方面,提供了一种包含指令的计算机程序产品,当其在节点设备上运行时,使得所述节点设备能够实现上述事件处理方法。In another aspect, a computer program product containing instructions is provided, which when run on a node device, enables the node device to implement the event processing method described above.
另一方面,提供了一种芯片,所述芯片包括处理器和/或程序指令,当所述芯片运行时,实现上述事件处理方法。In another aspect, a chip is provided. The chip includes a processor and / or program instructions. When the chip is running, the event processing method is implemented.
本申请实施例提供的技术方案带来的有益效果是:The beneficial effects brought by the technical solutions provided in the embodiments of the present application are:
本申请实施例提供的系统、方法、设备及计算机可读存储介质,设计了全异步式的系统架构,数据采集系统通过第一个数据处理模块指示数据源提供下一批数据,每个数据处理模块可以同时接收数据和处理数据,实现不同数据处理模块异步处理不同数据的效果,保证数据采集系统可以同时针对多批数据进行处理,避免数据处理模块要等待其他数据处理模块执行数据处理操作完成后才能开始处理数据的情况,提高了数据采集的效率,节省了数据采集的时间。The system, method, device and computer-readable storage medium provided in the embodiments of the present application design a fully asynchronous system architecture. The data acquisition system instructs the data source to provide the next batch of data through the first data processing module, and each data is processed. The module can receive and process data at the same time, achieving the effect of asynchronous processing of different data by different data processing modules, ensuring that the data acquisition system can process multiple batches of data at the same time, avoiding that the data processing module must wait for other data processing modules to perform data processing operations The situation in which data can be processed can improve the efficiency of data collection and save the time of data collection.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例, 对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions in the embodiments of the present application more clearly, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are just some embodiments of the application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without paying creative labor.
图1是本申请实施例提供的数据采集系统的架构示意图;FIG. 1 is a schematic structural diagram of a data acquisition system according to an embodiment of the present application; FIG.
图2是本申请实施例提供的分布式的数据采集系统的示意图;2 is a schematic diagram of a distributed data acquisition system according to an embodiment of the present application;
图3是本申请实施例提供的数据采集系统的架构示意图;FIG. 3 is a schematic structural diagram of a data acquisition system according to an embodiment of the present application; FIG.
图4是本申请实施例提供的数据采集系统的架构示意图;FIG. 4 is a schematic structural diagram of a data acquisition system provided by an embodiment of the present application; FIG.
图5是本申请实施例提供的数据采集系统的架构示意图;FIG. 5 is a schematic structural diagram of a data acquisition system according to an embodiment of the present application; FIG.
图6是本申请实施例提供的数据采集系统的架构示意图;FIG. 6 is a schematic structural diagram of a data acquisition system provided by an embodiment of the present application; FIG.
图7是本申请实施例提供的数据采集系统的架构示意图;FIG. 7 is a schematic structural diagram of a data acquisition system according to an embodiment of the present application; FIG.
图8是本申请实施例提供的数据采集系统的架构示意图;FIG. 8 is a schematic structural diagram of a data acquisition system according to an embodiment of the present application; FIG.
图9是本申请实施例提供的第一个数据处理模块插入栏栅消息的示意图;9 is a schematic diagram of inserting a fence message by a first data processing module according to an embodiment of the present application;
图10是本申请实施例提供的数据采集方法的流程图;FIG. 10 is a flowchart of a data collection method according to an embodiment of the present application; FIG.
图11是本申请实施例提供的节点设备的结构示意图。FIG. 11 is a schematic structural diagram of a node device according to an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In the following, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
数据采集(Data Acquisition,DAQ),又称数据获取,是指从数据源自动采集数据并进行处理后存储入库的过程。随着大数据技术的不断深入,数据源非常丰富且数据类型多样,需要数据采集系统存储、分析、挖掘的数据量庞大,同时对数据采集系统的性能要求也与日俱增,往往要求数据采集系统具有高有效性、高可靠性。因此,如何构建能够高效采集海量数据的数据采集系统成为业界关心的重点问题。Data acquisition (DAQ), also known as data acquisition, refers to the process of automatically collecting data from a data source and processing it and storing it in a database. With the continuous deepening of big data technology, the data sources are very rich and the data types are diverse. The amount of data that needs to be stored, analyzed, and mined by the data collection system is huge. At the same time, the performance requirements of the data collection system are also increasing, often requiring high data collection systems. Effectiveness and high reliability. Therefore, how to build a data acquisition system capable of efficiently collecting massive data has become a key issue of concern in the industry.
本申请实施例搭建了一种全新的数据采集系统的架构,数据采集系统全异步式运行,该数据采集系统主要具有以下突出特点:第一,高效率:数据采集系统中不同数据处理模块可以异步处理不同批次的数据,每个数据处理模块处理完当前的数据即可自动处理下一批数据,无需等待其他数据处理模块处理完成。第二,分布式:数据采集系统可支持分布式采集方式,不同数据处理模块可部署于不同节点设备中,由不同节点设备执行不同的数据处理操作。第三:可扩展:数据采集系统中数据处理模块的具体处理逻辑、数据处理模块的数量、数据处理模块所处的节点设备、数据处理模块的并发量均可以根据实际业务需求进行部署,灵活性高。第四,高可靠:数据采集系统中设计了快照机制,实现断点续传功能,任一数据处理模块宕机并重启后可自动恢复对宕机前处理的数据进行处理,可靠性和鲁棒性强。第五:健壮性:数据采集系统中设计了反压机制,系统面临流量峰值时可自动降低从数据源获取数据的速度,避免节点设备负载过重,避免对节点设备的性能造成影响。The embodiment of the present application sets up a new data acquisition system architecture. The data acquisition system operates asynchronously. The data acquisition system mainly has the following outstanding features: First, high efficiency: different data processing modules in the data acquisition system can be asynchronous. Process different batches of data, each data processing module can automatically process the next batch of data after processing the current data, without waiting for other data processing modules to complete processing. Second, distributed: The data acquisition system can support distributed acquisition methods, different data processing modules can be deployed in different node devices, and different node devices perform different data processing operations. Third: Scalable: the specific processing logic of the data processing module in the data acquisition system, the number of data processing modules, the node equipment where the data processing module is located, and the concurrent amount of the data processing module can be deployed according to actual business needs, flexibility high. Fourth, high reliability: The snapshot mechanism is designed in the data acquisition system to realize the function of continuous transmission of breakpoints. After any data processing module is down and restarted, it can automatically recover the data processed before the downtime, which is reliable and robust. Strong sex. Fifth: Robustness: A back pressure mechanism is designed in the data collection system. When the system faces traffic peaks, it can automatically reduce the speed of obtaining data from the data source, avoid overloading the node device, and avoid affecting the performance of the node device.
以下,对本申请实施例提供的数据采集系统的架构以及基于这种系统架构进行数据采集的方法进行具体阐述。In the following, the architecture of the data collection system and the method for data collection based on such a system architecture are specifically explained.
图1是本申请实施例提供的一种数据采集系统的架构示意图,该数据采集系统包 括多个数据处理模块,可以根据实际业务需求将多种数据处理操作分配给不同的数据处理模块,由每个数据处理模块依次执行对应的数据处理操作,以完成对数据执行多种数据处理操作的功能。FIG. 1 is a schematic structural diagram of a data acquisition system according to an embodiment of the present application. The data acquisition system includes multiple data processing modules. Multiple data processing operations can be allocated to different data processing modules according to actual business requirements. Each data processing module sequentially executes the corresponding data processing operation to complete the function of performing various data processing operations on the data.
数据采集系统的系统架构:数据采集系统可以看作一条采集流水线,数据采集系统中每个数据处理模块顺次级联,上一个数据处理模块输出的数据会作为下一个数据处理模块的输入的数据,其中第一个数据处理模块和数据源连接,最后一个数据处理模块和第一存储源连接。示例性地,参见图1,假设数据采集系统中包括M个数据处理模块(M为大于1的正整数),对于数据采集系统中第一个数据处理模块来说,第一个数据处理模块与数据源以及第二个数据处理模块连接,对于数据采集系统中第i个数据处理模块来说(i为大于1且小于M的正整数),第i个数据处理模块与第i-1个数据处理模块以及第(i+1)个数据处理模块连接,对于数据采集系统中第M个数据处理模块(即最后一个数据处理模块)来说,第M个数据处理模块与M-1个数据处理模块以及第一存储源连接。System architecture of the data acquisition system: The data acquisition system can be regarded as an acquisition pipeline. Each data processing module in the data acquisition system is connected in cascade. The data output by the previous data processing module is used as the input data of the next data processing module. The first data processing module is connected to the data source, and the last data processing module is connected to the first storage source. Exemplarily, referring to FIG. 1, it is assumed that the data acquisition system includes M data processing modules (M is a positive integer greater than 1). For the first data processing module in the data acquisition system, the first data processing module and the The data source and the second data processing module are connected. For the i-th data processing module in the data acquisition system (i is a positive integer greater than 1 and less than M), the i-th data processing module is connected to the i-1th data The processing module and the (i + 1) th data processing module are connected. For the Mth data processing module (that is, the last data processing module) in the data acquisition system, the Mth data processing module and M-1 data processing The module and the first storage source are connected.
其中,数据源是指数据的来源,数据源用于提供待处理的原始数据,数据源的物理形态可以根据实际业务场景确定,例如数据源可以为消息队列,数据可以为消息队列中的消息,又如数据源可以为socket(套接字)端口,数据可以为socket数据,又如数据源可以为客户端,数据可以为页面数据、交互数据、表单数据、会话数据等,又如数据源可以为摄像机,则数据可以为摄像机拍摄的视频,又如数据源可以为传感器,则数据可以为传感器采集的数据。第一存储源是指用于存储经过多种数据处理操作后的数据的存储源,也可以称为底层存储或存储。第一存储源与最后一个数据处理模块连接,可以接收并存储最后一个数据处理模块处理后的数据。其中,第一存储源的物理形态可以根据实际业务场景确定,例如可以为硬盘、数据库、FTP(File Transfer Protocol,文件传输协议)服务器等。The data source refers to the source of the data. The data source is used to provide the raw data to be processed. The physical form of the data source can be determined according to the actual business scenario. For example, the data source can be a message queue and the data can be messages in the message queue. Another example is that the data source can be a socket (socket) port, the data can be socket data, and the data source can be a client, the data can be page data, interactive data, form data, session data, etc., and the data source can be For a camera, the data can be video captured by the camera, and if the data source can be a sensor, the data can be data collected by the sensor. The first storage source refers to a storage source for storing data that has undergone various data processing operations, and may also be referred to as low-level storage or storage. The first storage source is connected to the last data processing module, and can receive and store data processed by the last data processing module. The physical form of the first storage source may be determined according to an actual service scenario, and may be, for example, a hard disk, a database, or an FTP (File Transfer Protocol) server.
本实施例中,数据处理模块执行数据处理操作实际为数据处理模块所在的节点设备执行数据处理操作,任一个节点设备可以包括一个或多个数据处理模块,从而执行对应的一种或多种数据处理操作。数据处理模块具体可以通过软件的方式实现,例如数据处理模块可以为进程、线程、对象、方法、函数、代码块、脚本文件等。进一步地,当数据处理模块为进程时,数据处理模块内部可以包括多个线程,由多个线程并发地执行数据处理操作。In this embodiment, the data processing module performs the data processing operation to actually perform the data processing operation for the node device where the data processing module is located. Any node device may include one or more data processing modules to execute the corresponding one or more types of data. Processing operation. The data processing module may be specifically implemented by software. For example, the data processing module may be a process, a thread, an object, a method, a function, a code block, a script file, or the like. Further, when the data processing module is a process, the data processing module may include multiple threads internally, and the multiple threads concurrently perform data processing operations.
可选地,数据采集系统可以支持分布式采集和/或单机式采集的架构,具体包括以下两种设计:Optionally, the data acquisition system may support a distributed acquisition and / or stand-alone acquisition architecture, which specifically includes the following two designs:
设计一(分布式采集):数据采集系统的多个数据处理模块分别位于多个节点设备中,那么,参见图2,数据采集系统可以看作多个节点设备的集合,每个节点设备包括一个或多个数据处理模块,用于执行这一个或多个数据处理模块对应的数据处理操作,不同节点设备之间可通过网络连接,不同节点设备可以分布于不同地点,可以具有不同功能,还可以具有不同的物理形态,本申请实施例对此不做限定。进一步地,数据采集系统可以采用集群化设计,即同一个数据处理模块也可以部署于多个节点设备中,由多个节点设备执行相同的数据处理操作。Design one (distributed acquisition): multiple data processing modules of the data acquisition system are located in multiple node devices, then, referring to Figure 2, the data acquisition system can be regarded as a collection of multiple node devices, each node device includes one Or multiple data processing modules, used to perform the data processing operations corresponding to the one or more data processing modules, different node devices can be connected through the network, different node devices can be distributed in different locations, can have different functions, and can also They have different physical forms, which are not limited in the embodiments of the present application. Further, the data collection system can adopt a cluster design, that is, the same data processing module can also be deployed in multiple node devices, and multiple node devices perform the same data processing operation.
其中,数据采集系统中的数据处理模块部署于哪个节点设备中、每个节点设备部 署的数据处理模块的数量均可以根据实际业务需求确定。示例性地,参见图3,假设数据采集系统包括4个数据处理模块,可将4个数据处理模块分别部署于云端的3个节点设备中,每个节点设备可部署一个或多个数据处理模块。Among them, which node device the data processing module in the data acquisition system is deployed in, and the number of data processing modules deployed by each node device can be determined according to actual business requirements. Exemplarily, referring to FIG. 3, it is assumed that the data acquisition system includes four data processing modules, and the four data processing modules can be deployed in three node devices in the cloud, and each node device can deploy one or more data processing modules. .
在分布式采集的架构中,针对相邻的数据处理模块传输数据的过程,对于数据采集系统中任意两个相邻的数据处理模块,如果这两个相邻的数据处理模块分别位于两个节点设备,两个节点设备可以预先建立网络连接,这两个数据处理模块可以通过网络连接进行数据传输。可选地,任一个数据处理模块可以存储下一个数据处理模块对应的节点设备的网络地址,以便基于网络地址向下一个节点设备发送数据。另外,如果这两个相邻的数据处理模块位于同一节点设备,可以采用设备内部的通信方式进行数据传输。In the architecture of distributed acquisition, for the process of transmitting data by adjacent data processing modules, for any two adjacent data processing modules in the data acquisition system, if these two adjacent data processing modules are located at two nodes, respectively Device, two node devices can establish a network connection in advance, and these two data processing modules can perform data transmission through the network connection. Optionally, any data processing module may store the network address of the node device corresponding to the next data processing module, so as to send data to the next node device based on the network address. In addition, if the two adjacent data processing modules are located at the same node device, data transmission can be performed using the internal communication method of the device.
可选地,为了在分布式采集的过程中控制每个节点设备,数据采集系统还可以包括资源管理器,资源管理器可以看作是控制数据采集系统的控制系统,资源管理器可以部署于数据采集系统以外的某个节点设备中,或部署于数据采集系统中的某个节点设备中,开发人员可以在资源管理器中配置数据采集系统的流程描述信息,该流程描述信息包括各个数据处理模块之间的连接关系以及各个数据处理模块的标识,资源管理器可以将已配置的流程描述信息发送给每个节点设备,每个节点设备可以接收流程描述信息,基于本机中每个数据处理模块的标识,查询该流程描述信息,从而确定本机中每个数据处理模块的下一个数据处理模块,以便本机的每个数据处理模块在采集数据的过程中向下一个数据处理模块发送数据。其中,资源管理器可以为Yarn、Mesos等。Optionally, in order to control each node device in the process of distributed acquisition, the data acquisition system may further include a resource manager, the resource manager may be regarded as a control system that controls the data acquisition system, and the resource manager may be deployed on the data In a node device other than the acquisition system, or deployed in a node device in the data acquisition system, developers can configure the process description information of the data acquisition system in the resource manager. The process description information includes various data processing modules. The connection relationship and the identification of each data processing module, the resource manager can send the configured process description information to each node device, each node device can receive the process description information, based on each data processing module in the machine To query the process description information, so as to determine the next data processing module of each data processing module in this machine, so that each data processing module of this machine sends data to the next data processing module in the process of collecting data. Among them, the resource manager can be Yarn, Mesos, and so on.
设计二(单机采集式):参见图4,数据采集系统的多个数据处理模块位于同一个节点设备中,由这一个节点设备依次基于多个数据处理模块执行多种数据处理操作。其中,每个数据处理模块可以为节点设备中的某一进程、某一线程或某一方法。Design 2 (single-machine acquisition type): Referring to FIG. 4, multiple data processing modules of the data acquisition system are located in the same node device, and this one node device sequentially performs multiple data processing operations based on multiple data processing modules. Each data processing module may be a process, a thread, or a method in a node device.
综上所述,本申请实施例提供的数据采集系统支持分布式采集和/或单机式采集的架构,可以部署于一台或多台节点设备上,可以根据实际业务需求对数据处理模块任意扩展,灵活性和扩展性极高。In summary, the data acquisition system provided by the embodiments of the present application supports a distributed acquisition and / or stand-alone acquisition architecture, which can be deployed on one or more node devices, and the data processing module can be arbitrarily expanded according to actual business needs. High flexibility and scalability.
数据采集系统的功能:数据采集系统用于基于数据源的数据,由多个数据处理模块依次执行多种数据处理操作,并由最后一个数据处理模块将处理后的数据存储至第一存储源中。其中,每个数据处理模块用于执行一种类型的数据处理操作,通过不同数据处理模块可以分别对数据执行不同的数据处理操作。Functions of the data acquisition system: The data acquisition system is based on the data of the data source. Multiple data processing modules sequentially perform a variety of data processing operations, and the last data processing module stores the processed data in the first storage source. . Among them, each data processing module is used to perform one type of data processing operation, and different data processing operations can be performed on data through different data processing modules.
示例性地,数据采集系统执行的多种数据处理操作可以包括对数据进行清洗、转化、筛选、变形、统计、检测等操作。结合实际业务场景,以统计某网站上某博主的最新一条微博的转发数量为例,数据源可以为该网站,第一个数据处理模块可以从网站中拉取微博页面,将微博页面转化为结构化数据,发送给第二个数据处理模块,第二个数据处理模块从结构化数据中筛选出最新一条微博的所有数据,发送给第三个数据处理模块,第三个数据处理模块从最新一条微博的所有数据中提取转发数量,将转发数量存储至第一存储源。Exemplarily, the various data processing operations performed by the data acquisition system may include operations such as cleaning, transforming, filtering, deforming, statistics, and detecting the data. Combined with the actual business scenario, taking the number of reposts of the latest Weibo from a blogger on a website as an example, the data source can be the website, and the first data processing module can pull the Weibo page from the website and convert the Weibo The page is converted into structured data and sent to the second data processing module. The second data processing module filters all the data of the latest Weibo from the structured data and sends it to the third data processing module. The third data The processing module extracts the number of forwards from all the data of the latest Weibo and stores the number of forwards to the first storage source.
针对数据采集系统中数据的处理流程,第一个数据处理模块会获取数据源的数据,对数据源的数据进行处理,将处理后的数据发送给第二个数据处理模块,第二个数据 处理模块接收到第一个数据处理模块处理后的数据后,会对数据进行处理,将处理后的数据发送给第三个数据处理模块,依次类推,最后一个数据处理模块接收到倒数第二个数据处理模块处理后的数据后,会对数据进行处理,将处理后的数据存储至第一存储源中,至此,实现了在第一存储源中存储多种数据处理操作后的数据的功能。For the data processing process in the data acquisition system, the first data processing module will obtain the data from the data source, process the data from the data source, and send the processed data to the second data processing module and the second data processing After receiving the data processed by the first data processing module, the module processes the data, sends the processed data to the third data processing module, and so on, and the last data processing module receives the penultimate data After the data processed by the processing module is processed, the data is processed, and the processed data is stored in the first storage source. So far, the function of storing data after various data processing operations in the first storage source is implemented.
本实施例中,数据采集系统通过以下两点实现了全异步式的数据采集流程,从而极大地提高了数据采集的效率:In this embodiment, the data collection system implements a fully asynchronous data collection process through the following two points, thereby greatly improving the efficiency of data collection:
第一点、数据采集系统中第一个数据处理模块用于获取到数据源的任一批数据时,指示数据源提供下一批数据。First, when the first data processing module in the data acquisition system is used to obtain any batch of data from the data source, it instructs the data source to provide the next batch of data.
相关技术中,数据采集系统处理数据的处理逻辑为:任一批数据成功存储入库后,再处理下一批数据。基于该处理逻辑,数据采集系统会由最后一个线程负责向数据源反馈,那么,当数据源向第一个线程提供某批数据后,这批数据要依次经过每个线程的处理,直到最后一个线程处理数据完毕并将处理后的数据存储至存储源时,最后一个线程才会通知数据源提供下一批数据,而在从第一个线程得到数据到最后一个线程进行反馈的期间,数据源均不会提供下一批数据,任一个线程即使当前空闲,也得不到待处理的数据,也只能等待其他线程处理数据完毕,这就造成低效率和处理资源的浪费。In related technology, the processing logic of the data collection system for processing data is: after any batch of data is successfully stored in the warehouse, the next batch of data is processed. Based on this processing logic, the last thread is responsible for feedback to the data source in the data acquisition system. After the data source provides a certain batch of data to the first thread, this batch of data must be processed by each thread in turn until the last one. When the thread finishes processing the data and stores the processed data to the storage source, the last thread will notify the data source to provide the next batch of data. During the period when the data is obtained from the first thread and the last thread provides feedback, the data source No next batch of data will be provided. Even if any thread is currently idle, it will not be able to obtain data to be processed, and will only be able to wait for other threads to finish processing data, which results in inefficiency and waste of processing resources.
而本申请实施例提供的数据采集系统中,对数据采集系统的处理逻辑和向数据源反馈的主体均进行了改进,处理逻辑为:任一批数据已进入数据采集系统,即通知数据源提供下一批数据,而无需等待这批数据在数据采集系统内部依次处理完成。基于该处理逻辑,数据采集系统的第一个数据处理模块负责向数据源反馈,那么,当数据源向第一个数据处理模块提供某批数据后,第一个数据处理模块获取到这批数据后即会指示数据源提供下一批数据,则数据源即可向第一个数据处理模块继续提供下一批数据,以便数据采集系统可异步处理多批数据,避免出现数据处理模块要等待其他数据处理模块的情况。In the data acquisition system provided by the embodiment of the present application, the processing logic of the data acquisition system and the subject of feedback to the data source are improved. The processing logic is: any batch of data has entered the data acquisition system, that is, the data source is notified to provide The next batch of data, without having to wait for this batch of data to be processed in turn within the data acquisition system. Based on this processing logic, the first data processing module of the data acquisition system is responsible for feedback to the data source. Then, when the data source provides a certain batch of data to the first data processing module, the first data processing module obtains this batch of data After that, the data source will be instructed to provide the next batch of data, and the data source can continue to provide the next batch of data to the first data processing module, so that the data collection system can process multiple batches of data asynchronously, avoiding the data processing module waiting for Data processing module.
其中,针对第一个数据处理模块指示数据源提供下一批数据的方式,第一个数据处理模块可以向数据源发送确认消息,确认消息用于指示数据源提供下一批数据,当数据源接收到确认消息后,即会提供下一批数据。其中,确认消息的格式可以根据数据源与第一个数据处理模块之间的通信协议确定,另外,数据源提供的任一批数据的数据量可以根据实际业务场景确定,本申请实施例对此不做限定。Among them, with regard to the manner in which the first data processing module instructs the data source to provide the next batch of data, the first data processing module may send a confirmation message to the data source. The confirmation message is used to instruct the data source to provide the next batch of data. When the data source After receiving the confirmation message, the next batch of data is provided. The format of the confirmation message can be determined according to the communication protocol between the data source and the first data processing module. In addition, the amount of data of any batch of data provided by the data source can be determined according to the actual business scenario. No restrictions.
示例性地,第一个数据处理模块获取数据源的数据的具体过程可以包括以下两种设计:Exemplarily, the specific process of obtaining data of the data source by the first data processing module may include the following two designs:
设计一(主动拉取):第一个数据处理模块可以从数据源中拉取(fetch)数据,从而获取数据源的数据。具体地,第一个数据处理模块可以预先存储数据源的标识,基于数据源的标识访问数据源,从而从数据源中拉取数据。其中,数据源的标识用于唯一确定对应的数据源,可以为数据源的网络地址、名称、索引号等。Design one (active pull): The first data processing module can fetch data from the data source to obtain data from the data source. Specifically, the first data processing module may store the identity of the data source in advance, and access the data source based on the identity of the data source, thereby pulling data from the data source. The identifier of the data source is used to uniquely determine the corresponding data source, and may be a network address, a name, an index number, etc. of the data source.
以数据源为消息队列为例,第一个数据处理模块可以存储消息队列的名称和网络地址,基于消息队列的名称和网络地址,从消息队列中拉取数据。其中,消息队列是指消息的传输过程中保存消息的虚拟容器,消息队列可以为Kafka、ActiveMQ、RabbitMQ、ZeroMQ、MetaMQ等。Taking the data source as the message queue as an example, the first data processing module can store the name and network address of the message queue, and pull data from the message queue based on the name and network address of the message queue. The message queue refers to a virtual container that stores messages during the transmission of the messages. The message queues can be Kafka, ActiveMQ, RabbitMQ, ZeroMQ, MetaMQ, and so on.
设计二(被动接收):数据源可以主动向第一个数据处理模块发送数据,第一个数据处理模块可以接收数据源的数据,从而获取数据源的数据。Design 2 (passive reception): The data source can actively send data to the first data processing module, and the first data processing module can receive data from the data source, thereby obtaining data from the data source.
以数据源为socket端口为例,数据源可以调用socket端口,向第一个数据处理模块发送socket数据,第一个数据处理模块可以监听socket端口,从而接收socket数据。Taking the data source as the socket port as an example, the data source can call the socket port to send socket data to the first data processing module, and the first data processing module can listen to the socket port to receive socket data.
第二点、数据采集系统中的任一个数据处理模块还用于对已接收到的任一批数据执行对应的数据处理操作时,接收下一批数据。Secondly, any data processing module in the data collection system is further configured to receive the next batch of data when performing a corresponding data processing operation on any batch of data that has been received.
相关技术中,数据采集系统中每个线程采用同步处理方式进行数据处理:任一个线程先接收一批数据,再对已接收到的数据进行处理,在此期间线程无法接收继续数据,要等待这批数据处理完毕,发送给下一个线程后,该线程才能够接收下一批数据进而进行处理,也即是,线程会交替地执行接收数据的操作和处理数据的操作,任一时刻只能执行一种操作。In the related technology, each thread in the data acquisition system uses a synchronous processing method for data processing: any thread first receives a batch of data and then processes the received data. During this period, the thread cannot receive continued data. You must wait for this. After the batch data is processed and sent to the next thread, the thread can receive the next batch of data for processing, that is, the thread will alternately perform the operation of receiving data and the operation of processing data, which can only be executed at any time. An operation.
而本申请实施例提供的数据采集系统中,任一个数据处理模块采用异步处理方式进行数据处理:数据处理模块可以同时执行接收数据和处理数据的操作,数据处理模块在对任一批数据执行数据处理操作的过程中,无需等待这一批数据处理完成,即可接收下一批数据。通过异步处理方式,当一批数据处理完成后数据处理模块即可自动对已接收的下一批数据进行处理,保证当前处理的数据不会阻塞下一批数据的流入,进而保证数据处理模块尽可能地始终处于工作状态。同时,数据采集系统中不同数据处理模块可以处理不同批次的数据,例如假设第三个数据处理模块还在处理第一批的数据,而第一个数据处理模块可以已经开始处理第二批的数据,如此,可以提高数据采集系统整体的采集效率。In the data acquisition system provided by the embodiments of the present application, any data processing module uses asynchronous processing for data processing: the data processing module can perform operations of receiving data and processing data at the same time, and the data processing module executes data on any batch of data During the processing operation, the next batch of data can be received without waiting for this batch of data to be processed. Through asynchronous processing, when a batch of data is processed, the data processing module can automatically process the next batch of data that has been received, ensuring that the currently processed data does not block the inflow of the next batch of data, and thus ensuring that the data processing module Possibly always working. At the same time, different data processing modules in the data acquisition system can process different batches of data. For example, assuming that the third data processing module is still processing the first batch of data, and the first data processing module may have started processing the second batch of data. Data, in this way, can improve the overall collection efficiency of the data collection system.
可选地,为了能够缓存接收到的数据,数据采集系统中任一个数据处理模块可以具有对应的内存空间,该内存空间用于缓存已接收到的数据,数据采集系统中任一个数据处理模块还用于对已接收到的任一批数据执行对应的数据处理操作时,将接收到的下一批数据缓存至该任一个数据处理模块的内存空间中。那么,当处理任一批数据完成后,数据处理模块直接从该内存空间中读取该下一批数据即可,极大地提高了处理数据的效率。Optionally, in order to be able to buffer the received data, any data processing module in the data acquisition system may have a corresponding memory space, where the memory space is used to buffer the received data, and any data processing module in the data acquisition system also has When performing a corresponding data processing operation on any batch of data that has been received, the next batch of data received is buffered into the memory space of any one of the data processing modules. Then, after processing any batch of data, the data processing module can directly read the next batch of data from the memory space, which greatly improves the efficiency of processing data.
示例性地,参见图5,在数据采集系统的架构中,任一个数据处理模块之前可以置有内存空间,任一批数据可以先进入数据处理模块的内存空间中,在内存空间中进行缓存,再从内存空间进入到数据处理模块中。其中,当数据进入任一个数据处理模块的内存空间时,如果数据处理模块当前空闲,则数据处理模块可直接从内存空间中读取数据并开始处理,如果数据处理模块正在处理之前的数据,可以处理完毕后再从内存空间中读取数据并开始处理。Exemplarily, referring to FIG. 5, in the architecture of the data acquisition system, a memory space may be placed before any data processing module, and any batch of data may first enter the memory space of the data processing module and be cached in the memory space. Then enter the data processing module from the memory space. Among them, when the data enters the memory space of any data processing module, if the data processing module is currently idle, the data processing module can directly read the data from the memory space and start processing. If the data processing module is processing the previous data, you can After processing, read the data from the memory space and start processing.
需要说明的是,上述仅是以对任一批数据和下一批数据这两批数据的操作为例进行说明,在实施中,任一个数据处理模块对已接收到的任一批数据执行对应的数据处理操作时,可以接收之后的多批数据,还可以当处理任一批数据后,依次对已接收到的多批数据执行数据处理操作。例如假设数据处理模块处理第一批数据时,如果接收到第二批数据,则将第二批数据缓存至内存空间中,之后又接收到第三批数据,则再将第三批数据也缓存至内存空间中,当处理第一批数据后,依次处理第二批数据和第三批数据,依次类推。It should be noted that the above description is only based on the operation of any batch of data and the next batch of data as an example. In implementation, any data processing module performs correspondence on any batch of data that has been received. During the data processing operation, multiple subsequent batches of data can be received, and after any batch of data is processed, data processing operations can be performed on the received multiple batches of data in sequence. For example, if the data processing module processes the first batch of data, if it receives the second batch of data, it caches the second batch of data into the memory space, and then receives the third batch of data, and then caches the third batch of data. In the memory space, when the first batch of data is processed, the second batch of data and the third batch of data are processed in turn, and so on.
结合内存空间的设计,数据采集系统中任一个数据处理模块执行数据处理操作后,可以将处理后的数据推送至下一个数据处理模块的内存空间中。示例性地,任一个数据处理模块可以存储下一个数据处理模块的通信端口的端口号,基于该端口号向下一个数据处理模块的通信端口发送处理后的数据,下一个数据处理模块预先将该通信端口与对应的通信端口进行了绑定,会自动将该任一个处理模块处理后的数据缓存至通信端口对应的内存空间中。需要说明的是,本段描述的内存空间可以指堆内内存(on-heap memory),在JVM(Java虚拟机,Java Virtual Machine)的环境中,堆内内存是指由节点设备的虚拟机管理的内存,具有容易实现等优势,任一个数据处理模块可以用于将处理后的数据推送至下一个数据处理模块的堆内内存中。Combined with the design of the memory space, after any data processing module in the data acquisition system performs a data processing operation, the processed data can be pushed to the memory space of the next data processing module. Exemplarily, any data processing module can store the port number of the communication port of the next data processing module, and send the processed data to the communication port of the next data processing module based on the port number. The communication port is bound to the corresponding communication port, and the data processed by any processing module is automatically buffered into the memory space corresponding to the communication port. It should be noted that the memory space described in this paragraph can refer to on-heap memory (on-heap memory). In the JVM (Java Virtual Machine, Java Virtual Machine) environment, the on-heap memory refers to the virtual machine managed by the node device. Memory has the advantages of easy implementation and so on. Any data processing module can be used to push the processed data to the on-chip memory of the next data processing module.
可选地,结合上述分布式采集的方式,数据处理模块的内存空间还可以包括堆外内存(off-heap memory),堆外内存是指由节点设备的操作系统管理的内存,堆外内存通常用于存储本机要发送至远端的数据。Optionally, in combination with the above-mentioned distributed acquisition method, the memory space of the data processing module may further include off-heap memory (off-heap memory). Off-heap memory refers to the memory managed by the operating system of the node device. Off-heap memory is usually It is used to store the data that the machine will send to the far end.
基于堆内内存和堆外内存的原理,数据处理模块可以在堆内内存中缓存接收到的数据,在堆外内存中缓存待发送给其他节点设备的数据处理模块的数据。Based on the principles of on-heap memory and off-heap memory, the data processing module can cache the received data in on-heap memory and cache the data on the off-heap memory to be sent to data processing modules of other node devices.
具体地,任一个处理模块对数据进行处理后,当任一个数据处理模块和下一个数据处理模块位于同一节点设备时,任一个数据处理模块可以直接将处理后的数据推送至下一个数据处理模块的堆内内存中。而当任一个处理模块和下一个数据处理模块位于不同节点设备时,任一个数据处理模块可以对处理后的数据进行序列化,将序列化的数据存储至任一个数据处理模块的堆外内存中,将序列化后的数据从堆外内存推送至下一个数据处理模块的堆内内存中。其中,序列化(Serialization)是指对数据的格式进行转换,以使转换后的数据的格式可以进行网络传输的过程。可选地,任一个数据处理模块可以采用zero-copy(零拷贝)技术,执行将序列化后的数据从堆外内存推送至下一个数据处理模块的堆内内存的步骤。另外,当数据处理模块直接采用操作系统的内存空间接收数据时,则内存空间无需区分为堆内空间和堆外空间。Specifically, after any processing module processes data, when any data processing module and the next data processing module are located on the same node device, any data processing module can directly push the processed data to the next data processing module. In-memory. When any processing module and the next data processing module are located on different node devices, any data processing module can serialize the processed data and store the serialized data in the off-heap memory of any data processing module. , Push the serialized data from the off-heap memory to the on-heap memory of the next data processing module. Among them, serialization (Serialization) refers to the process of converting the format of the data so that the format of the converted data can be transmitted over the network. Optionally, any data processing module may use a zero-copy (zero copy) technology to execute the step of pushing the serialized data from the off-heap memory to the on-heap memory of the next data processing module. In addition, when the data processing module directly uses the memory space of the operating system to receive data, the memory space does not need to be distinguished into internal space and external space.
示例性地,参见图6,假设数据处理模块以p表示,当数据处理模块1(p1)对数据处理完成后,假设p1要将处理后的数据发送到与p1位于同一节点设备的p2中,则p1直接将处理后的数据推送至数据处理模块2(p2)的堆内内存即可。假设p1要将处理后的数据发送到位于另一节点设备的数据处理模块3(p3)中,则p1先将处理后的数据序列化,再将序列化的数据存储至p1的堆外内存中,再将序列化后的数据从p1的堆外内存推送至p3的堆内内存。For example, referring to FIG. 6, it is assumed that the data processing module is represented by p. After the data processing module 1 (p1) finishes processing the data, it is assumed that p1 sends the processed data to p2 located on the same node device as p1. Then p1 can directly push the processed data to the on-chip memory of data processing module 2 (p2). Suppose p1 wants to send the processed data to data processing module 3 (p3) located on another node device, then p1 serializes the processed data first, and then stores the serialized data into the off-heap memory of p1. Then, the serialized data is pushed from the off-heap memory of p1 to the on-heap memory of p3.
可选地,为了给每个数据处理模块分配内存空间,参见图7,数据采集系统还可以包括至少一个共享内存池,每个共享内存池包括大量的内存空间,每个共享内存池用于为对应的多个数据处理模块提供内存空间。例如,数据采集系统中的任一个节点设备可以部署一个共享内存池,该共享内存池为该节点设备中的所有数据处理模块提供内存空间,那么假设数据采集系统部署于M个节点设备(M为不小于1的正整数),则数据采集系统包括M组数据处理模块,相应地包括M个共享内存池。其中,共享内存池内的内存空间可以包括堆内内存和堆外内存,共享内存池可以包括多个内存页,内存页包括多个内存段,可将内存页中的一个或多个内存段作为每次为数据处理模块分配的内存空间。在实施中,数据采集系统启动时,每个节点设备的操作系统可以申 请一定大小的连续内存空间,作为创建的共享内存池。Optionally, in order to allocate memory space to each data processing module, as shown in FIG. 7, the data acquisition system may further include at least one shared memory pool, each shared memory pool includes a large amount of memory space, and each shared memory pool is used for Corresponding multiple data processing modules provide memory space. For example, any node device in a data acquisition system can deploy a shared memory pool that provides memory space for all data processing modules in the node device. Then assume that the data acquisition system is deployed on M node devices (M is A positive integer not less than 1), the data acquisition system includes M sets of data processing modules, and accordingly includes M shared memory pools. Among them, the memory space in the shared memory pool can include on-heap memory and off-heap memory, the shared memory pool can include multiple memory pages, the memory page includes multiple memory segments, and one or more memory segments in the memory page can be used as each The memory space allocated for the data processing module. In implementation, when the data acquisition system is started, the operating system of each node device can request a certain size of continuous memory space as a shared memory pool created.
针对一组数据处理模块共享内存空间的具体过程,在数据采集之前,任一个节点设备可以预先从共享内存池中取出一定大小的内存空间,平均分配给每个数据处理模块,在数据采集的过程中,任一个数据处理模块可以在预先分配的内存空间中存储接收到的数据。进一步地,随着数据不断流入数据处理模块的内存空间,如果数据处理模块的内存空间不足,数据处理模块可以重新申请共享内存池中的内存空间,以便在申请的内存空间中存储接收到的数据,从而扩展自身的内存空间。For the specific process of a group of data processing modules sharing memory space, before data collection, any node device can take a certain amount of memory space from the shared memory pool in advance, and evenly allocate it to each data processing module. During the data collection process, In any one of the data processing modules, the received data can be stored in a pre-allocated memory space. Further, as data continuously flows into the memory space of the data processing module, if the memory space of the data processing module is insufficient, the data processing module may reapply for the memory space in the shared memory pool, so as to store the received data in the requested memory space. , Thereby expanding its own memory space.
可选地,结合共享内存池,本实施例可以支持内存复用的功能:任一个数据处理模块从共享内存池申请内存空间后,可以将接收到的数据缓存至申请的内存空间中,再从内存空间中读取数据,从而使用申请到的内存空间。而当申请到的内存空间使用完毕后,任一个数据处理模块可以将重新申请的内存空间释放回共享内存池中,那么共享内存池可以将该内存空间重新分配给其他数据处理模块,或者在该数据处理模块下次内存空间不足时重新分配给该数据处理模块,从而实现内存复用,提高内存空间的利用率,节约内存资源。其中,当共享内存池中的内存空间以内存页为单位时,每个数据处理模块使用内存页时可以尽可能一页用尽后再使用下一页,以便在释放内存时可以连续释放内存页,避免内存碎片的产生以及内存空间的浪费。Optionally, in combination with the shared memory pool, this embodiment can support the function of memory reuse: after any data processing module applies for memory space from the shared memory pool, it can cache the received data into the requested memory space, and then from Read data from the memory space to use the requested memory space. After the requested memory space is used up, any data processing module can release the re-applied memory space back to the shared memory pool, and then the shared memory pool can reallocate the memory space to other data processing modules, or The data processing module is re-allocated to the data processing module next time when the memory space is insufficient, thereby realizing the memory reuse, improving the utilization of the memory space, and saving the memory resources. Among them, when the memory space in the shared memory pool is in units of memory pages, each data processing module can use as many pages as possible before using the next page, so that the memory pages can be continuously released when the memory is released. To avoid memory fragmentation and waste of memory space.
综上所述,本申请实施例提供的数据采集系统,设计了全异步式的数据采集系统,通过数据采集系统的第一个数据处理模块指示数据源提供下一批数据,并且每个数据处理模块可以同时接收数据和处理数据,实现数据采集系统中不同数据处理模块异步处理不同数据的效果,保证数据采集系统可以同时针对多批数据进行处理,避免数据处理模块要等待其他数据处理模块执行数据处理操作完成后才能开始处理数据的情况,提高了数据采集的效率,节省了数据采集的时间。In summary, the data acquisition system provided in the embodiments of the present application is designed with a fully asynchronous data acquisition system. The first data processing module of the data acquisition system instructs the data source to provide the next batch of data, and each data is processed. The module can receive and process data at the same time, to achieve the effect of asynchronous processing of different data by different data processing modules in the data collection system, to ensure that the data collection system can process multiple batches of data at the same time, to avoid the data processing module waiting for other data processing modules to execute data The situation in which data can only be processed after the processing operation is completed improves the efficiency of data collection and saves the time of data collection.
可选地,参见图8,数据采集系统还包括第二存储源,第二存储源与数据采集系统中的每个数据处理模块连接,第二存储源的物理形态可以与第一存储源类似,第二存储源可以具体为分布式存储源。需要说明的是,第二存储源和第一存储源可以为不同的存储源,例如第二存储源和第一存储源可以部署于不同的设备,当然第二存储源和第一存储源也可以为同一个存储源,本申请实施例对此不做限定。Optionally, referring to FIG. 8, the data acquisition system further includes a second storage source. The second storage source is connected to each data processing module in the data acquisition system. The physical form of the second storage source may be similar to the first storage source. The second storage source may be specifically a distributed storage source. It should be noted that the second storage source and the first storage source can be different storage sources, for example, the second storage source and the first storage source can be deployed on different devices, of course, the second storage source and the first storage source can also be For the same storage source, this embodiment of the present application does not limit this.
第二存储源的功能:第二存储源用于存储数据采集系统当前处理的数据的快照单元,快照单元(snapshot)是指当前处理的数据的映像,快照单元可以看作是数据在数据采集系统流转过程中的备份存档,快照单元用于指示当前处理对应数据的数据处理模块,例如快照单元可以携带当前处理对应数据的数据处理模块的标识,通过数据处理模块的标识指示对应的数据处理模块。其中,数据处理模块的标识用于唯一确定对应的数据处理模块,可以为数据处理模块的名称、编号、地址等。Function of the second storage source: The second storage source is used to store a snapshot unit of the data currently processed by the data acquisition system. The snapshot unit refers to an image of the currently processed data. The snapshot unit can be regarded as the data in the data acquisition system. In the backup and archive during the transfer process, the snapshot unit is used to indicate the data processing module that currently processes the corresponding data. For example, the snapshot unit can carry the ID of the data processing module that currently processes the corresponding data, and the corresponding data processing module is indicated by the ID of the data processing module. The identifier of the data processing module is used to uniquely determine the corresponding data processing module, and may be the name, number, address, etc. of the data processing module.
本实施例中,可以通过以下(1)至(3)实现第二存储源存储当前处理的数据的快照单元的功能:In this embodiment, the following (1) to (3) can be used to implement the function of the snapshot unit where the second storage source stores the currently processed data:
(1)第一个数据处理模块还用于在数据源的不同数据之间插入栏栅消息,将相邻两个栏栅消息之间的所有数据作为一个快照单元,将快照单元存储至第二存储源。(1) The first data processing module is also used to insert fence messages between different data from the data source, use all the data between two adjacent fence messages as a snapshot unit, and store the snapshot unit to the second Storage source.
栏栅消息用于指示快照单元的起始点或结束点,第一个数据处理模块通过插入栏栅消息,可以将数据源的数据分割成不同的快照单元。栏栅消息不会影响数据处理模 块对数据的正常处理操作,数据处理模块在处理数据的过程中遇到栏栅消息时可以自动跳过。The fence message is used to indicate the starting point or end point of the snapshot unit. The first data processing module can divide the data of the data source into different snapshot units by inserting the fence message. The fence message will not affect the normal processing operation of the data by the data processing module. The data processing module can automatically skip when it encounters the fence message during the data processing process.
针对插入栏栅消息的具体过程,第一个数据处理模块可以在接收到数据源的任一批数据后,可以每隔预设条数,在数据源的数据中插入一条栏栅消息。或者每当插入一条栏栅消息后开始计时,当记录的时长超过预设时长时再次插入一条栏栅消息。其中预设条数和/或预设时长的具体数值可以根据实际业务需求确定,可以由开发人员预先设置。Regarding the specific process of inserting a fence message, the first data processing module can insert a fence message into the data of the data source at a preset number after receiving any batch of data from the data source. Or start timing every time a fence message is inserted, and insert a fence message again when the recorded duration exceeds a preset duration. The specific value of the preset number and / or the preset duration may be determined according to actual business requirements, and may be set in advance by a developer.
针对在第二存储源存储快照单元的具体过程,第一个数据处理模块在插入栏栅消息后,可以将两个相邻的栏栅消息之间的所有数据作为一个快照单元,即对于两个相邻的栏栅消息来说,可将前一个栏栅消息作为快照单元的起始点,将后一个栏栅消息作为快照单元的结束点,将前一个栏栅消息、两个栏栅消息之间的所有数据和后一个栏栅消息组成一个快照单元,从而得到快照单元。之后,第一个数据处理模块可以向快照单元中添加第一个数据处理模块的标识,再将添加了标识的快照单元存储至第二存储源中。For the specific process of storing the snapshot unit in the second storage source, after the first data processing module inserts the fence message, all data between two adjacent fence messages can be used as a snapshot unit, that is, for two For adjacent fence messages, the previous fence message can be used as the starting point of the snapshot unit, the next fence message can be used as the end point of the snapshot unit, and the previous fence message and between the two fence messages can be used. All the data and the next fence message constitute a snapshot unit, so as to obtain a snapshot unit. After that, the first data processing module may add the identification of the first data processing module to the snapshot unit, and then store the added snapshot unit to the second storage source.
示例性地,参见图9,假设数据源提供的数据以mi表示,栏栅消息以bi表示,如果数据源提供的数据为m1、m2、m3……m100,第一个数据处理模块每隔4条数据插入一条栏栅消息bi,则插入栏栅消息后,数据源的数据变为b1、m1、m2、m3、m4、b2、m5、m6、m7、m8、b3……m100,第一个快照单元为b1、m1、m2、m3、m4,第二个快照单元为b2、m5、m6、m7、m8、b3,依次类推。Exemplarily, referring to FIG. 9, it is assumed that the data provided by the data source is represented by mi and the fence message is represented by bi. If the data provided by the data source is m1, m2, m3 ... m100, the first data processing module is every 4 A piece of data is inserted into a fence message bi. After the fence message is inserted, the data of the data source becomes b1, m1, m2, m3, m4, b2, m5, m6, m7, m8, b3 ... m100, the first The snapshot units are b1, m1, m2, m3, m4, the second snapshot unit is b2, m5, m6, m7, m8, b3, and so on.
(2)数据采集系统中任一个数据处理模块还用于在处理任一数据的过程中,向第二存储源中数据所属的快照单元添加数据处理模块的标识。(2) Any one of the data processing modules in the data collection system is further configured to add an identification of the data processing module to a snapshot unit to which the data belongs in the second storage source during the processing of any data.
随着快照单元在数据采集系统的不同数据处理模块之间流转,数据采集系统的多个数据处理模块会依次对快照单元进行处理,在任一个数据处理模块处理任一数据的过程中,该任一个数据处理模块可以确定数据所属的快照单元,向第二存储源中的快照单元添加该任一个数据处理模块的标识,以便通过数据处理模块的标识,标记快照单元正在该数据处理模块中进行处理。As the snapshot unit flows between different data processing modules of the data acquisition system, multiple data processing modules of the data acquisition system will sequentially process the snapshot unit. During the processing of any data by any data processing module, any one The data processing module may determine the snapshot unit to which the data belongs, and add the identifier of any one of the data processing modules to the snapshot unit in the second storage source, so as to mark the snapshot unit is being processed in the data processing module through the identification of the data processing module.
具体地,任一个数据处理模块可以采用动态修改快照单元的方式,添加该任一个数据处理模块的标识:当任一个数据处理模块处理快照单元内的数据之前,快照单元会携带前一个数据模块的标识,则数据处理模块处理快照单元内的数据时,会将快照单元中前一个数据模块的标识修改为该数据处理模块的标识。另外,在数据处理模块处理数据的过程中,如果数据处理模块对任一快照单元中的部分数据进行了修改,可以将第二存储源中对应的快照单元中的这部分数据也替换为修改后的数据,从而对快照单元进行刷新。而如果数据处理模块未对接收到的数据进行修改,则无需对快照单元进行修改。Specifically, any data processing module can dynamically modify the snapshot unit and add the identity of the data processing module: before any data processing module processes the data in the snapshot unit, the snapshot unit will carry the data of the previous data module. ID, when the data processing module processes the data in the snapshot unit, it will modify the ID of the previous data module in the snapshot unit to the ID of the data processing module. In addition, during the processing of data by the data processing module, if the data processing module has modified some of the data in any snapshot unit, the data in the corresponding snapshot unit in the second storage source can also be replaced with the modified data. Data to refresh the snapshot unit. If the data processing module does not modify the received data, there is no need to modify the snapshot unit.
(3)数据采集系统中最后一个数据处理模块在将任一数据存储至第一存储源后,从第二存储源中删除数据所属的快照单元。(3) After the last data processing module in the data acquisition system stores any data to the first storage source, it deletes the snapshot unit to which the data belongs from the second storage source.
对于存储源的任一批数据来说,当该任一批数据已经经过多个数据处理模块的处理,由最后一个数据处理模块存储至第一存储源后,可以认为这一批数据已经达到了处理和存储的目的,即使数据采集系统此时宕机,重启后也可从第一存储源中随时得 到这一批处理后的数据,因此第二存储源无需继续存储这一批数据的快照单元。因此,当最后一个处理模块在第一存储源存储任一批处理后的数据后,可以确定该处理后的数据在第二存储源中所属的快照单元,从第二存储源中删除该快照单元,以便节省第二存储源的存储空间。For any batch of data from the storage source, when the batch of data has been processed by multiple data processing modules and stored by the last data processing module to the first storage source, it can be considered that this batch of data has reached For processing and storage purposes, even if the data acquisition system is down at this time, the batch of processed data can be obtained from the first storage source at any time after restarting, so the second storage source does not need to continue to store the snapshot unit of this batch of data. . Therefore, after the last processing module stores any batch of processed data in the first storage source, it can determine the snapshot unit to which the processed data belongs in the second storage source, and delete the snapshot unit from the second storage source. To save the storage space of the second storage source.
结合上述(1)至(3),本实施例实现了快照机制,通过标记数据采集系统中流转的数据当前所处的数据处理模块,能够实现断点续传的功能:在数据采集系统运行中,受到网络异常、设备故障等因素的影响,任一个数据处理模块可能出现宕机,而当任一个数据处理模块宕机并重启后,任一个数据处理模块可以基于第二存储源中的快照单元,恢复对宕机前处理的数据进行数据处理,而无需从头开始,重新对第一批数据进行处理。In combination with the above (1) to (3), this embodiment implements a snapshot mechanism. By marking the data processing module where the data flowing in the data acquisition system is currently located, the function of resuming the breakpoint can be realized: while the data acquisition system is running Under the influence of factors such as network abnormality and equipment failure, any data processing module may go down. When any data processing module goes down and restarts, any data processing module can be based on the snapshot unit in the second storage source. , Resume data processing before the downtime processing of data, without having to start from the beginning, re-processing the first batch of data.
针对恢复对宕机前处理的数据进行数据处理的过程,任一个数据处理模块可以基于该数据处理模块的标识,在第二存储源中进行搜索,从第二存储源中获取已添加该数据处理模块的标识的至少一个快照单元,该至少一个快照单元中的数据即为数据处理模块在系统宕机前处理的数据,该数据处理模块可以从第二存储源中获取该至少一个快照单元,并继续对该至少一个快照单元的数据进行处理。Regarding the process of restoring the data processing of the data processed before the outage, any data processing module can search in the second storage source based on the identity of the data processing module, and obtain the added data processing from the second storage source. The at least one snapshot unit identified by the module. The data in the at least one snapshot unit is the data processed by the data processing module before the system goes down. The data processing module can obtain the at least one snapshot unit from the second storage source, and Processing of the data of the at least one snapshot unit is continued.
可选地,在实施中,在业务旺季、客流高峰、促销活动等各种场景中,数据源提供的数据量会迅猛增长,达到极大值点,即出现流量峰值。为了应对流量峰值,本实施例提供以下两种设计,能够保证数据采集系统在面临流量峰值时保持系统的稳定性。Optionally, in implementation, in various scenarios such as peak business seasons, peak passenger flows, and promotional activities, the amount of data provided by the data source will increase rapidly, reaching a maximum point, that is, a peak in traffic. In order to cope with traffic peaks, this embodiment provides the following two designs, which can ensure that the data acquisition system maintains the stability of the system when facing a traffic peak.
设计一(反压机制)、数据采集系统可以在面临流量峰值时,如果内存空间已经不足,则进入反压状态,即降低从数据源中获取数据的速度,从而避免数据源提供的数据量超过系统的负载能力。Design one (back pressure mechanism). When the data acquisition system is facing a peak of traffic, if the memory space is insufficient, it will enter a back pressure state, that is, reduce the speed of obtaining data from the data source, so as to prevent the amount of data provided by the data source from exceeding The load capacity of the system.
可选地,反压机制具体可以通过以下(1)和(2)实现:Optionally, the back pressure mechanism may be specifically implemented by the following (1) and (2):
(1)当第一个数据处理模块所在的节点设备满足预设条件时,第一个数据处理模块降低从数据源中获取数据的速度。(1) When the node device where the first data processing module is located meets the preset conditions, the first data processing module reduces the speed of obtaining data from the data source.
当面临流量峰值时,数据源的数据会不断流入第一个数据处理模块的内存空间,导致第一个数据处理模块占用的内存空间越来越多。第一个数据处理模块可以在采集数据的过程中,实时检测所在的节点设备是否满足预设条件,当节点设备满足预设条件时,获知节点设备的内存空间已经不足,则降低从数据源中获取数据的速度,从而避免节点设备的内存空间过度消耗,进而避免影响节点设备的性能。When faced with a traffic peak, data from the data source will continuously flow into the memory space of the first data processing module, resulting in more and more memory space occupied by the first data processing module. The first data processing module can detect whether the node device that meets the preset conditions in real time during the data collection process. When the node device meets the preset conditions, it can be learned that the memory space of the node device is insufficient. The speed of obtaining data, so as to avoid excessive consumption of the memory space of the node device, thereby avoiding affecting the performance of the node device.
其中,该预设条件可以包括以下条件1和条件2的任意组合:The preset condition may include any combination of the following condition 1 and condition 2:
条件1、节点设备的共享内存池当前占用的内存达到第一预设阈值。Condition 1. The memory currently occupied by the shared memory pool of the node device reaches a first preset threshold.
由于节点设备的共享内存池可以看作节点设备为内部的所有数据采集模块总共预备的内存空间,当共享内存池当前占用的内存过多时,表明节点设备内每个数据处理模块当前缓存的数据量已经过大,超过了每个数据处理模块的数据处理能力。The shared memory pool of a node device can be regarded as the total memory space prepared by the node device for all internal data acquisition modules. When the shared memory pool currently occupies too much memory, it indicates the amount of data currently cached by each data processing module in the node device. It is too large and exceeds the data processing capacity of each data processing module.
因此,第一个数据处理模块可以实时检测对应的共享内存池当前占用的内存是否达到第一预设阈值,当共享内存池当前占用的内存达到第一预设阈值,则确定节点设备满足预设条件,降低从数据源获取数据的速度,那么,后续随着每个数据处理模块处理数据完毕后,将内存空间陆续释放回共享内存池,共享内存池中的内存会逐渐充盈,重新回归到当前占用的内存小于第一预设阈值的状态。那么,共享内存池可以维 持内存充足,保证节点设备内每个数据处理模块正常运行。其中,该第一预设阈值可以根据业务需求确定,可以由开发人员设置。Therefore, the first data processing module can detect in real time whether the memory currently occupied by the corresponding shared memory pool reaches the first preset threshold. When the memory currently occupied by the shared memory pool reaches the first preset threshold, it is determined that the node device meets the preset Conditions to reduce the speed of obtaining data from the data source, then, as each data processing module finishes processing the data, the memory space is gradually released back to the shared memory pool, and the memory in the shared memory pool will gradually fill up and return to the current A state in which the occupied memory is less than a first preset threshold. Then, the shared memory pool can maintain sufficient memory to ensure the normal operation of each data processing module in the node device. The first preset threshold may be determined according to business requirements, and may be set by a developer.
条件2、节点设备的数据处理模块的当前申请内存的线程的总数量达到第二预设阈值。Condition 2: The total number of threads currently applying for memory of the data processing module of the node device reaches a second preset threshold.
任一个数据处理模块内可以活跃多个线程,每个线程可以向共享内存池申请内存以存储接收到的数据,如果申请内存的线程过多,表明节点设备内每个数据处理模块当前缓存的数据量已经过大,超过了每个数据处理模块的数据处理能力。Multiple threads can be active in any data processing module. Each thread can request memory from the shared memory pool to store the received data. If there are too many threads requesting memory, it indicates that the data currently cached by each data processing module in the node device. The volume has been too large, exceeding the data processing capacity of each data processing module.
因此,第一个数据处理模块可以实时检测节点设备内的数据处理模块的当前申请内存的线程的总数量是否达到第二预设阈值,若当前申请内存的线程的总数量是否达到第二预设阈值,则确定节点设备满足预设条件,降低从数据源获取数据的速度。示例性地,假设第一个数据处理模块p1所在的节点设备包括三个数据处理模块:p1、p2和p3,p1可以检测p1、p2和p3对应的申请内存的线程的总数量是否达到第二预设阈值。其中,该第二预设阈值可以根据业务需求确定,可以由开发人员设置。Therefore, the first data processing module can detect in real time whether the total number of threads currently applying for memory of the data processing module in the node device reaches the second preset threshold, and if the total number of threads currently applying for memory reaches the second preset threshold The threshold value determines that the node device meets a preset condition and reduces the speed of obtaining data from the data source. Exemplarily, assuming that the node device where the first data processing module p1 is located includes three data processing modules: p1, p2, and p3, p1 can detect whether the total number of threads that apply for memory corresponding to p1, p2, and p3 reaches the second Preset threshold. The second preset threshold may be determined according to business requirements, and may be set by a developer.
针对降低从数据源中获取数据的速度的具体过程,当第一个数据处理模块以主动拉取的方式从数据源中获取数据时,第一个数据处理模块可以对原始拉取速度除以预设倍数,得到降低后的拉取速度,按照降低后的拉取速度拉取数据,从而放慢拉取数据的速度。例如,第一个数据处理模块原本每隔1s拉取一次数据,降低拉取速度后,第一个数据处理模块可以每隔5s拉取一次数据。当第一个数据处理模块以被动接收的方式从数据源中获取数据时,第一个数据处理模块可以通知数据源降低发送数据的速度,数据源得到通知后可以降低发送数据的速度,以便第一个数据处理模块可以降低获取数据的速度。其中,第一个数据处理模块可以向数据源发送通知消息,通过通知消息通知数据源降低发送数据的速度。For the specific process of reducing the speed of obtaining data from the data source, when the first data processing module obtains data from the data source in an active pull manner, the first data processing module can divide the original pull speed by the pre- Set the multiple to get the reduced pull speed, pull the data according to the reduced pull speed, and then slow down the data pull speed. For example, the first data processing module originally pulls data every 1s. After reducing the pull speed, the first data processing module can pull data every 5s. When the first data processing module obtains data from the data source in a passive receiving manner, the first data processing module can notify the data source to reduce the speed of sending data. After the data source is notified, it can reduce the speed of sending data, so that the first A data processing module can reduce the speed of obtaining data. The first data processing module can send a notification message to the data source, and notify the data source to reduce the speed of sending data through the notification message.
(2)数据采集系统中第一个数据处理模块以外的任一个数据处理模块还用于在对应节点设备满足预设条件时,指示第一个数据处理模块降低从数据源中获取数据的速度。(2) Any data processing module other than the first data processing module in the data acquisition system is also used to instruct the first data processing module to reduce the speed of obtaining data from the data source when the corresponding node device meets the preset conditions.
结合数据采集系统中数据的流向,可将第一个数据处理模块想象为控制整个系统中数据流速的水龙头,第一个数据处理模块可以认为是其后的每个数据处理模块的数据的来源,第一个数据处理模块获取数据的速度直接影响了其后的每个数据处理模块获取数据的速度。Combined with the data flow in the data acquisition system, the first data processing module can be thought of as a faucet that controls the data flow rate in the entire system. The first data processing module can be considered as the source of the data of each subsequent data processing module. The speed at which the first data processing module obtains data directly affects the speed at which each subsequent data processing module obtains data.
因此,第一个数据处理模块以外的任一个数据处理模块可以检测对应的节点设备是否满足预设条件,当对应的节点设备满足预设条件时,指示第一个数据处理模块降低从数据源中获取数据的速度。第一个数据处理模块得到指示后,会降低从数据源获取数据的速度,那么,由于第一个数据处理模块获取数据的速度降低,则其后的每个数据处理模块由于连锁反应,获取数据的速度也会随之降低。其中,针对任一个数据处理模块指示第一个数据处理模块降低从数据源中获取数据的速度的方式,任一个数据处理模块可以向第一个数据处理模块发送通知消息,该通知消息用于通知降低从数据源中获取数据的速度,第一个数据处理模块接收该通知消息,即可确定要降低从数据源中获取数据的速度。另外,第一个数据处理模块以外的任一个数据处理模块检测对应节点设备满足预设条件的具体过程与上述步骤一同理,在此不做赘述。Therefore, any data processing module other than the first data processing module can detect whether the corresponding node device satisfies a preset condition, and when the corresponding node device meets the preset condition, instruct the first data processing module to reduce the amount of data from the data source. Speed of getting data. After the first data processing module is instructed, it will reduce the speed of obtaining data from the data source. Then, because the speed of obtaining data by the first data processing module is reduced, each subsequent data processing module obtains data due to a chain reaction. The speed will also decrease. Among them, for any manner in which the data processing module instructs the first data processing module to reduce the speed of obtaining data from the data source, any data processing module may send a notification message to the first data processing module, and the notification message is used to notify Reduce the speed of obtaining data from the data source. The first data processing module receives the notification message to determine that the speed of obtaining data from the data source should be reduced. In addition, a specific process in which any data processing module other than the first data processing module detects that the corresponding node device satisfies a preset condition is performed together with the above steps, and details are not described herein.
需要说明的是,在反压机制的基础上,本实施例还提供了数据采集系统从反压状态恢复为原本采集数据的速度的机制,结合上述反压机制的(1)和(2),从反压状态恢复的机制具体可以包括以下(a)和(b)。It should be noted that on the basis of the backpressure mechanism, this embodiment also provides a mechanism for the data collection system to recover from the backpressure state to the original speed of data collection. In combination with (1) and (2) of the above backpressure mechanism, The mechanism for recovering from the back pressure state may specifically include the following (a) and (b).
(a)当第一个数据处理模块降低从数据源中获取数据的速度后,当第一个数据处理模块所在的节点设备不满足预设条件的次数达到预设次数时,第一个数据处理模块恢复从数据源中获取数据的速度。(a) After the first data processing module reduces the speed of obtaining data from the data source, when the number of times that the node device where the first data processing module is located does not meet the preset conditions reaches the preset number of times, the first data processing module The speed at which the module recovers data from the data source.
本设计与反压机制的(1)对应。当第一个数据处理模块降低从数据源中获取数据的速度后,数据源的数据流入第一个数据处理模块的速度会减缓,则第一个数据处理模块的节点设备的内存空间会逐渐恢复充足。在此过程中,第一个数据处理模块可以再次检测节点设备是否满足预设条件,当节点设备满足预设条件时,累计节点设备满足预设条件的次数。当节点设备满足预设条件的次数尚未达到预设次数时,认为节点设备的内存空间尚未足够,此时如果恢复之前从数据源获取数据的速度,可能导致节点设备很快又要进入反压状态,导致节点设备的状态不断波动,因此暂时不恢复原来从数据源中获取数据的速度,而当节点设备满足预设条件的次数达到预设次数时,认为节点设备的内存空间已经足够,则恢复进入反压状态之前从数据源获取数据的速度。其中,该预设次数可以根据实际业务需求确定,可以为2次。This design corresponds to (1) of the back pressure mechanism. When the first data processing module reduces the speed of obtaining data from the data source, the data source data flow into the first data processing module will slow down, and the memory space of the node device of the first data processing module will gradually recover. sufficient. In this process, the first data processing module can detect whether the node device meets the preset condition again, and when the node device meets the preset condition, accumulate the number of times the node device meets the preset condition. When the number of times the node device meets the preset conditions has not reached the preset number of times, the memory space of the node device is considered insufficient. At this time, if the speed of obtaining data from the data source before recovery may cause the node device to enter the back pressure state soon. , Causing the state of the node device to fluctuate continuously, so the original speed of obtaining data from the data source is not restored for the time being, and when the node device meets the preset conditions the number of times reaches the preset number of times, it is considered that the memory space of the node device is sufficient, and then resume The speed at which data is obtained from the data source before entering the backpressure state. The preset number of times may be determined according to actual service requirements, and may be two times.
(b)数据采集系统中第一个数据处理模块以外的任一个数据处理模块指示第一个数据处理模块降低从数据源中获取数据的速度后,该任一个数据处理模块所在的节点设备不满足预设条件的次数达到预设次数时,该任一个数据处理模块向第一个数据处理模块指示第一个数据处理模块恢复从数据源中获取数据的速度。其中,任一个数据处理模块可以向第一个数据处理模块发送恢复消息,该恢复消息用于指示第一个数据处理模块恢复从数据源中获取数据的速度。(b) After any data processing module other than the first data processing module in the data acquisition system instructs the first data processing module to reduce the speed of obtaining data from the data source, the node device where the data processing module is located is not satisfied When the number of preset conditions reaches the preset number of times, any one of the data processing modules indicates to the first data processing module the speed at which the first data processing module resumes acquiring data from the data source. Any one of the data processing modules may send a recovery message to the first data processing module, and the recovery message is used to instruct the first data processing module to recover the speed of obtaining data from the data source.
本设计与反压机制的(2)对应,当第一个数据处理模块降低从数据源中获取数据的速度后,数据源的数据流入第一个数据处理模块的速度会减缓,则任一个数据处理模块受到第一个数据处理模块的影响,数据流入该任一个数据处理模块的速度会减缓,则该任一个数据处理模块所在的节点设备的内存空间会逐渐恢复充足。在此过程中,该任一个数据处理模块可以同理地检测任一个数据处理模块所在的节点设备不满足预设条件的次数是否达到预设次数,当任一个数据处理模块所在的节点设备不满足预设条件的次数达到预设次数时,该任一个数据处理模块会向第一个数据处理模块发送恢复消息,第一个数据处理模块接收到恢复消息后,会恢复进入反压状态前获取数据的速度,那么随着第一个数据处理模块获取数据的速度的加快,该任一个数据处理模块受到连锁反应的影响,获取数据的速度也会加快。This design corresponds to (2) of the back pressure mechanism. When the first data processing module reduces the speed of obtaining data from the data source, the speed at which data from the data source flows into the first data processing module will slow down. The processing module is affected by the first data processing module. The speed at which data flows into any one of the data processing modules will slow down, and the memory space of the node device where the data processing module is located will gradually recover to sufficient. In this process, any data processing module can similarly detect whether the number of times that the node device where any data processing module does not meet the preset conditions reaches the preset number of times. When the node device where any data processing module is located does not meet When the number of preset conditions reaches the preset number of times, any one of the data processing modules will send a recovery message to the first data processing module. After receiving the recovery message, the first data processing module will recover data before entering the back pressure state. Speed, then as the speed of obtaining data by the first data processing module is accelerated, any one of the data processing modules is affected by the chain reaction, and the speed of obtaining data will also be accelerated.
设计二(增加并发量)、在实施中,数据采集系统中的任一个数据处理模块可以生成至少一个线程,可以将数据处理操作的处理逻辑写入每个线程中,当数据处理模块需要处理任一批数据时,可以根据需要将这批数据中的不同数据分配给不同的线程,控制至少一个线程并发地处理数据,从而提高处理数据的效率。也即是,任一个数据处理模块可以基于至少一个线程,对任一批数据执行对应的数据处理操作。显然,数据处理模块的线程的数量越多,处理数据的速度越快。Design 2 (increasing the amount of concurrency). In the implementation, any data processing module in the data acquisition system can generate at least one thread, and the processing logic of the data processing operation can be written into each thread. When the data processing module needs to process any In a batch of data, different data in the batch of data can be allocated to different threads according to needs, and at least one thread can be controlled to process data concurrently, thereby improving the efficiency of processing data. That is, any data processing module may perform a corresponding data processing operation on any batch of data based on at least one thread. Obviously, the greater the number of threads in the data processing module, the faster the data can be processed.
因此,数据采集系统中任一个数据处理模块当检测到流量峰值时,可以增加并发 量,并发量是指数据处理模块中对数据进行处理的线程的数量,并发量越大,则数据处理模块处理数据的速度越快,从而高并发地进行数据处理,缓解处理大量数据的压力。Therefore, when any data processing module in the data acquisition system detects a peak traffic, it can increase the amount of concurrency. The amount of concurrency refers to the number of threads that process data in the data processing module. The larger the amount of concurrency, the data processing module processes The faster the data speed, the more concurrent the data processing will be, reducing the pressure on processing large amounts of data.
针对检测流量峰值的具体过程,任一个数据处理模块可以检测当前接收数据的速度,判断接收数据的速度是否达到预设速度阈值,当接收数据的速度已经达到预设速度阈值时,表明数据正在快速流入数据处理模块,则确定发生了流量峰值。或者,任一个数据处理模块可以检测当前接收数据的速度,计算接收数据的速度与历史接收数据的速度之间的差值,当该差值已经达到预设差值阈值时,表明数据量相比历史采集过程来说快速增长,则确定发生了流量峰值。当然数据处理模块还可以采用其他方式检测流量峰值,对此不做限定。For the specific process of detecting traffic peaks, any data processing module can detect the current speed of receiving data and determine whether the speed of receiving data reaches a preset speed threshold. When the speed of receiving data has reached a preset speed threshold, it indicates that the data is fast. Flowing into the data processing module, it is determined that a traffic peak has occurred. Alternatively, any data processing module can detect the current speed of receiving data, calculate the difference between the speed of receiving data and the speed of historical receiving data. When the difference has reached a preset difference threshold, it indicates that the amount of data is compared The rapid growth in the history collection process determines that a traffic peak has occurred. Of course, the data processing module can also detect traffic peaks in other ways, which is not limited.
另外,在该任一个数据处理模块增加并发量后,如果不再检测到流量峰值,可以减少并发量,从而将并发量恢复至之前的并发量。In addition, after any one of the data processing modules increases the amount of concurrency, if the traffic peak is no longer detected, the amount of concurrency can be reduced, thereby restoring the amount of concurrency to the previous amount of concurrency.
需要说明的是,在实施中数据处理模块可以择一执行上述设计一和设计二中的任一种,例如可以当检测到流量峰值后,判断节点设备是否满足预设条件,当节点设备不满足预设条件时执行设计二,当节点设备满足预设条件时执行设计一。当然数据处理模块也可以同时执行设计一和设计二。It should be noted that, during implementation, the data processing module may choose to execute either of the first design and the second design. For example, after detecting the peak value of the traffic, it is possible to determine whether the node device meets the preset conditions. Design 2 is performed when the conditions are preset, and Design 1 is performed when the node device meets the preset conditions. Of course, the data processing module can also execute Design 1 and Design 2 at the same time.
基于上述图1实施例阐述的数据采集系统的系统架构以及数据采集系统内部的数据处理模块的功能,以下对本申请实施例提供的数据采集方法进行介绍。Based on the system architecture of the data acquisition system and the functions of the data processing module inside the data acquisition system described in the embodiment of FIG. 1 described above, the data acquisition method provided by the embodiment of the present application is described below.
图10是本申请实施例提供的一种数据采集方法的流程图,该方法应用于数据采集系统的数据采集系统中,可以由数据采集系统内的各个数据处理模块进行交互实现。参见图10,该方法包括以下步骤:FIG. 10 is a flowchart of a data collection method according to an embodiment of the present application. The method is applied to a data collection system of a data collection system, and may be implemented by each data processing module in the data collection system interactively. Referring to FIG. 10, the method includes the following steps:
1001、当数据采集系统中第一个数据处理模块获取数据源的任一批数据时,该第一个数据处理模块指示该数据源提供下一批数据。1001. When the first data processing module in the data acquisition system obtains any batch of data from the data source, the first data processing module instructs the data source to provide the next batch of data.
1002、当数据采集系统中任一个数据处理模块对已接收到的任一批数据执行对应的数据处理操作时,该任一个数据处理模块接收下一批数据。1002. When any one of the data processing modules in the data acquisition system performs a corresponding data processing operation on any batch of data that has been received, the any data processing module receives the next batch of data.
1003、数据采集系统中最后一个数据处理模块将处理后的数据存储至第一存储源中。1003. The last data processing module in the data acquisition system stores the processed data in the first storage source.
本实施例提供的方法,提供了全异步式的数据采集方法,通过数据采集系统的第一个数据处理模块指示数据源提供下一批数据,并且每个数据处理模块可以同时接收数据和处理数据,保证数据采集系统中不同数据处理模块能异步地处理不同数据,避免数据处理模块要等待其他数据处理模块处理数据完成后才能开始处理数据的情况,提高了数据采集的效率,节省了数据采集的时间。The method provided in this embodiment provides a fully asynchronous data collection method. The first data processing module of the data collection system instructs the data source to provide the next batch of data, and each data processing module can receive and process data simultaneously. To ensure that different data processing modules in the data acquisition system can process different data asynchronously, to avoid the situation where the data processing module waits for other data processing modules to process the data before it can start processing the data, which improves the efficiency of data collection and saves data collection. time.
在一种可能的设计中,该数据采集系统中任一个数据处理模块具有对应的内存空间,该内存空间用于缓存已接收到的数据;In a possible design, any data processing module in the data acquisition system has a corresponding memory space, and the memory space is used to buffer the received data;
该当任一个数据处理模块对已接收到的任一批数据执行对应的数据处理操作时,该任一个数据处理模块接收下一批数据,包括:When any data processing module performs a corresponding data processing operation on any batch of data that has been received, the any data processing module receives the next batch of data, including:
该任一个数据处理模块对已接收到的任一批数据执行对应的数据处理操作时,该任一个数据处理模块将接收到的下一批数据缓存至该任一个数据处理模块的内存空间 中;When any one data processing module performs a corresponding data processing operation on any batch of data that has been received, the any data processing module buffers the next batch of received data into the memory space of any one data processing module;
当该任一个数据处理模块处理该任一批数据完成后,该任一个数据处理模块从该内存空间中读取该下一批数据。After any data processing module finishes processing any batch of data, the any data processing module reads the next batch of data from the memory space.
在一种可能的设计中,该数据采集系统还包括至少一个共享内存池,每个共享内存池用于为对应的多个数据处理模块提供内存空间。In a possible design, the data acquisition system further includes at least one shared memory pool, and each shared memory pool is used to provide a memory space for a corresponding plurality of data processing modules.
在一种可能的设计中,该当任一个数据处理模块对已接收到的任一批数据执行对应的数据处理操作时,该任一个数据处理模块接收下一批数据,包括:In a possible design, when any data processing module performs a corresponding data processing operation on any batch of data that has been received, the any data processing module receives the next batch of data, including:
该任一个数据处理模块从对应的共享内存池中申请内存空间;Any one of the data processing modules applies for memory space from the corresponding shared memory pool;
当申请得到的内存空间使用完毕后,该任一个数据处理模块将该申请得到的内存空间释放回该共享内存池中。When the memory space obtained by the application is used up, any data processing module releases the memory space obtained by the application back to the shared memory pool.
在一种可能的设计中,该数据采集系统中任一个数据处理模块的内存空间包括堆内内存;In a possible design, the memory space of any data processing module in the data acquisition system includes on-chip memory;
该第一个数据处理模块指示该数据源提供下一批数据之后,该方法还包括:After the first data processing module instructs the data source to provide the next batch of data, the method further includes:
该任一个数据处理模块将处理后的数据推送至下一个数据处理模块的堆内内存中。Any one of the data processing modules pushes the processed data to the on-chip memory of the next data processing module.
在一种可能的设计中,该数据采集系统中任一个数据处理模块的内存空间还包括堆外内存;In a possible design, the memory space of any data processing module in the data acquisition system further includes off-heap memory;
该任一个数据处理模块将处理后的数据推送至下一个数据处理模块的堆内内存中,包括:Any one of the data processing modules pushes the processed data to the on-chip memory of the next data processing module, including:
当该任一个数据处理模块和下一个数据处理模块位于同一节点设备时,该任一个数据处理模块将处理后的数据推送至该下一个数据处理模块的堆内内存中;或,When the data processing module and the next data processing module are located on the same node device, the data processing module pushes the processed data to the on-chip memory of the next data processing module; or,
当该任一个处理模块和下一个数据处理模块位于不同节点设备时,该任一个数据处理模块对处理后的数据进行序列化,将序列化的数据存储至该任一个数据处理模块的堆外内存中,将该序列化后的数据从该堆外内存推送至下一个数据处理模块的堆内内存中。When any one of the processing modules and the next data processing module are located at different node devices, the any data processing module serializes the processed data and stores the serialized data to the off-heap memory of the any data processing module In the process, the serialized data is pushed from the off-heap memory to the on-heap memory of the next data processing module.
在一种可能的设计中,该数据采集系统还包括第二存储源,该第二存储源用于存储该数据采集系统当前处理的数据的快照单元,该快照单元用于指示当前处理对应数据的数据处理模块;In a possible design, the data acquisition system further includes a second storage source, where the second storage source is used to store a snapshot unit of data currently processed by the data acquisition system, and the snapshot unit is used to indicate the current processing of the corresponding data. Data processing module
该方法还包括:The method also includes:
当任一个数据处理模块宕机并重启后,该任一个数据处理模块基于该第二存储源中的快照单元,恢复对宕机前处理的数据进行数据处理。When any data processing module is down and restarted, the any data processing module resumes data processing on the data processed before the downtime based on the snapshot unit in the second storage source.
本设计中,在全异步式的数据采集系统的基础上结合了快照机制,能够实现断点续传的功能,当数据处理模块宕机并重启后,数据处理模块可以基于第二存储源的快照,恢复对宕机前处理的数据进行数据处理,避免由于宕机导致必须重新开始处理数据的情况,提高数据采集系统的鲁棒性和可靠性。In this design, the snapshot mechanism is combined on the basis of a fully asynchronous data acquisition system, which can realize the function of resumed transmission at a breakpoint. When the data processing module is down and restarted, the data processing module can be based on the snapshot of the second storage source To restore the data processing of the data processed before the downtime, avoid the situation that the data must be restarted due to the downtime, and improve the robustness and reliability of the data collection system.
在一种可能的设计中,该任一个数据处理模块基于该第二存储源中的快照单元,恢复对宕机前处理的数据进行数据处理之前,该方法还包括:In a possible design, before any one of the data processing modules is based on the snapshot unit in the second storage source, and before the data processing of the data processed before the downtime is resumed, the method further includes:
第一个数据处理模块在该数据源的不同数据之间插入栏栅消息;The first data processing module inserts a fence message between different data of the data source;
该第一个数据处理模块将相邻两个栏栅消息之间的所有数据作为一个快照单元;The first data processing module uses all data between two adjacent fence messages as a snapshot unit;
该第一个数据处理模块将该快照单元存储至该第二存储源中,其中该栏栅消息用于指示快照单元的起始点或结束点。The first data processing module stores the snapshot unit in the second storage source, wherein the fence message is used to indicate a starting point or an end point of the snapshot unit.
在一种可能的设计中,该任一个数据处理模块基于该第二存储源中的快照单元,恢复对宕机前处理的数据进行数据处理之前,该方法还包括:In a possible design, before any one of the data processing modules is based on the snapshot unit in the second storage source, and before the data processing of the data processed before the downtime is resumed, the method further includes:
任一个数据处理模块在处理任一数据的过程中,该任一个数据处理模块向该第二存储源中该数据所属的快照单元添加该数据处理模块的标识。During the processing of any data by any data processing module, the any data processing module adds the identity of the data processing module to the snapshot unit to which the data belongs in the second storage source.
在一种可能的设计中,该方法还包括:In a possible design, the method further includes:
当该第一个数据处理模块所在的节点设备满足预设条件时,该第一个数据处理模块降低从该数据源中获取数据的速度。When the node device where the first data processing module is located meets a preset condition, the first data processing module reduces the speed of obtaining data from the data source.
在一种可能的设计中,该方法还包括:In a possible design, the method further includes:
当该第一个数据处理模块以外的任一个数据处理模块对应的节点设备满足预设条件时,该第一个数据处理模块以外的任一个数据处理模块指示该第一个数据处理模块降低从该数据源中获取数据的速度。When the node device corresponding to any data processing module other than the first data processing module meets a preset condition, any data processing module other than the first data processing module instructs the first data processing module to reduce The speed at which data is retrieved from the data source.
基于上述设计,在全异步式的数据采集系统的基础上结合了反压机制,当数据采集系统的节点设备满足预设条件时,第一个数据处理模块会降低从数据源获取数据的速度,从而在面临流量峰值时缓解内存不足的情况,提高数据采集系统的鲁棒性和可靠性。Based on the above design, the backpressure mechanism is combined on the basis of a fully asynchronous data acquisition system. When the node equipment of the data acquisition system meets the preset conditions, the first data processing module will reduce the speed of obtaining data from the data source. This can alleviate the lack of memory when facing traffic peaks, and improve the robustness and reliability of the data acquisition system.
在一种可能的设计中,该预设条件包括节点设备的共享内存池当前占用的内存达到第一预设阈值,该共享内存池用于为该节点设备的多个数据处理模块提供内存空间,该内存空间用于缓存已接收到的数据;和/或,In a possible design, the preset condition includes that a memory currently occupied by a node device's shared memory pool reaches a first preset threshold, and the shared memory pool is used to provide memory space for multiple data processing modules of the node device, This memory space is used to buffer the data that has been received; and / or,
该预设条件包括节点设备的数据处理模块的当前申请内存的线程的总数量达到第二预设阈值。The preset condition includes that the total number of threads currently applying for memory of the data processing module of the node device reaches a second preset threshold.
在一种可能的设计中,该数据采集系统中任一个数据处理模块还用于基于至少一个线程,对任一批数据执行对应的数据处理操作;In a possible design, any data processing module in the data acquisition system is further configured to perform a corresponding data processing operation on any batch of data based on at least one thread;
该方法还包括:The method also includes:
该任一个数据处理模块还用于当检测到流量峰值时增加并发量,该并发量是指数据处理模块中对数据进行处理的线程的数量。Any one of the data processing modules is also used to increase the amount of concurrency when a traffic peak is detected, and the amount of concurrency refers to the number of threads that process data in the data processing module.
在一种可能的设计中,该多个数据处理模块分别位于多个节点设备中。In a possible design, the multiple data processing modules are respectively located in multiple node devices.
需要说明的是,本实施例提供的数据采集方法与上述图1实施例提供的数据采集系统的实施例属于同一构思,其具体过程详见上述图1实施例,在此不做赘述。It should be noted that the data collection method provided by this embodiment belongs to the same concept as the embodiment of the data collection system provided by the embodiment shown in FIG. 1 above, and the specific process thereof is described in detail in the embodiment shown in FIG.
图11是本申请实施例提供的一种节点设备的结构示意图,该节点设备1100可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)1101和一个或一个以上的存储器1102,其中,该存储器1102中存储有至少一条指令,该至少一条指令由该处理器1101加载并执行以实现上述各个方法实施例提供的方法。当然,该节点设备还可以具有有线或无线网络接口以及输入输出接口等部件,以便进行输入输出,该节点设备还可以包括其他用于实现设备功能的部件,在此不做赘述。FIG. 11 is a schematic structural diagram of a node device according to an embodiment of the present application. The node device 1100 may have a large difference due to different configurations or performance, and may include one or more processors (central processing units) (CPU) 1101. And one or more memories 1102, where at least one instruction is stored in the memory 1102, and the at least one instruction is loaded and executed by the processor 1101 to implement the methods provided by the foregoing method embodiments. Of course, the node device may also have components such as a wired or wireless network interface and an input-output interface for input and output, and the node device may further include other components for implementing device functions, and details are not described herein.
在示例性实施例中,还提供了一种计算机可读存储介质,例如包括指令的存储器,上述指令可由节点设备中的处理器执行以完成上述实施例中的数据采集方法。例如,该计算机可读存储介质可以是只读内存(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)、磁带、软盘和光数据存储设备等。In an exemplary embodiment, a computer-readable storage medium is also provided, such as a memory including instructions, and the foregoing instructions may be executed by a processor in a node device to complete the data collection method in the foregoing embodiment. For example, the computer-readable storage medium may be Read-Only Memory (ROM), Random Access Memory (RAM), Compact Disc (Read-Only Memory, CD-ROM), Magnetic tapes, floppy disks, and optical data storage devices.
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。A person of ordinary skill in the art may understand that all or part of the steps for implementing the foregoing embodiments may be implemented by hardware, or may be instructed by a program to complete related hardware. The program may be stored in a computer-readable storage medium. The storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk.
以上所述仅为本申请的较佳实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。The above is only a preferred embodiment of the present application and is not intended to limit the present application. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present application shall be included in the protection of the present application. Within range.
在一个示例性实施例中,本申请还提供了一种包含指令的计算机程序产品,当其在节点设备上运行时,使得该节点设备能够实现上述实施例中的数据采集方法。In an exemplary embodiment, the present application further provides a computer program product containing instructions, which when executed on a node device, enables the node device to implement the data collection method in the foregoing embodiment.
在一个示例性实施例中,本申请还提供了一种芯片,该芯片包括处理器和/或程序指令,当该芯片运行时,实现上述实施例中的数据采集方法。In an exemplary embodiment, the present application further provides a chip that includes a processor and / or program instructions. When the chip runs, the data acquisition method in the foregoing embodiment is implemented.
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,该程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。Those of ordinary skill in the art may understand that all or part of the steps for implementing the foregoing embodiments may be completed by hardware, or may be instructed by a program to complete related hardware. The program may be stored in a computer-readable storage medium. The storage medium may be a read-only memory, a magnetic disk or an optical disk.
以上所述仅为本申请的较佳实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。The above is only a preferred embodiment of the present application and is not intended to limit the present application. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present application shall be included in the protection of the present application. Within range.

Claims (30)

  1. 一种数据采集系统,其特征在于,所述数据采集系统包括多个数据处理模块;A data acquisition system, characterized in that the data acquisition system includes a plurality of data processing modules;
    所述数据采集系统中第一个数据处理模块用于当获取到数据源的任一批数据时,指示所述数据源提供下一批数据;The first data processing module in the data acquisition system is used to instruct the data source to provide the next batch of data when any batch of data from the data source is obtained;
    所述数据采集系统中任一个数据处理模块用于对已接收到的任一批数据执行对应的数据处理操作时,接收下一批数据;Any data processing module in the data acquisition system is configured to receive the next batch of data when performing a corresponding data processing operation on any batch of data that has been received;
    所述数据采集系统中最后一个数据处理模块用于将处理后的数据存储至第一存储源中。The last data processing module in the data acquisition system is configured to store the processed data in a first storage source.
  2. 根据权利要求1所述的数据采集系统,其特征在于,所述数据采集系统中任一个数据处理模块具有对应的内存空间,所述内存空间用于缓存已接收到的数据;The data acquisition system according to claim 1, wherein any one of the data processing modules in the data acquisition system has a corresponding memory space, and the memory space is used to buffer received data;
    所述数据采集系统中任一个数据处理模块还用于对已接收到的任一批数据执行对应的数据处理操作时,将接收到的下一批数据缓存至所述任一个数据处理模块的内存空间中,当处理所述任一批数据完成后,从所述内存空间中读取所述下一批数据。Any one of the data processing modules in the data acquisition system is further configured to cache the next batch of received data to the memory of any one of the data processing modules when performing a corresponding data processing operation on any batch of data that has been received In the space, when the processing of any one batch of data is completed, the next batch of data is read from the memory space.
  3. 根据权利要求2所述的数据采集系统,其特征在于,所述数据采集系统还包括至少一个共享内存池,每个共享内存池用于为对应的多个数据处理模块提供内存空间。The data acquisition system according to claim 2, wherein the data acquisition system further comprises at least one shared memory pool, and each shared memory pool is used to provide a memory space for a corresponding plurality of data processing modules.
  4. 根据权利要求3所述的数据采集系统,其特征在于,所述任一个数据处理模块还用于从对应的共享内存池中申请内存空间,当申请得到的内存空间使用完毕后,将所述申请得到的内存空间释放回所述共享内存池中。The data acquisition system according to claim 3, wherein any one of the data processing modules is further configured to apply for a memory space from a corresponding shared memory pool, and when the memory space obtained by the application is used up, the application is applied. The obtained memory space is released back into the shared memory pool.
  5. 根据权利要求2至4中任一项所述的数据采集系统,其特征在于,所述数据采集系统中任一个数据处理模块的内存空间包括堆内内存;The data acquisition system according to any one of claims 2 to 4, wherein a memory space of any data processing module in the data acquisition system includes on-chip memory;
    所述任一个数据处理模块还用于将处理后的数据推送至下一个数据处理模块的堆内内存中。Any one of the data processing modules is further configured to push the processed data to a heap memory of a next data processing module.
  6. 根据权利要求5所述的数据采集系统,所述数据采集系统中任一个数据处理模块的内存空间还包括堆外内存;The data acquisition system according to claim 5, wherein a memory space of any one of the data processing modules in the data acquisition system further comprises off-heap memory;
    所述任一个数据处理模块还用于当所述任一个数据处理模块和下一个数据处理模块位于同一节点设备时,将处理后的数据推送至所述下一个数据处理模块的堆内内存中;或,The any data processing module is further configured to push the processed data to the in-heap memory of the next data processing module when the any data processing module and the next data processing module are located on the same node device; or,
    所述任一个数据处理模块还用于当所述任一个处理模块和下一个数据处理模块位于不同节点设备时,对处理后的数据进行序列化,将序列化的数据存储至所述任一个数据处理模块的堆外内存中,将所述序列化后的数据从所述堆外内存推送至下一个数据处理模块的堆内内存中。The any data processing module is further configured to serialize the processed data and store the serialized data to the any data when the any processing module and the next data processing module are located on different node devices. The off-heap memory of the processing module pushes the serialized data from the off-heap memory to the on-heap memory of the next data processing module.
  7. 根据权利要求1所述的数据采集系统,其特征在于,所述数据采集系统还包括第二存储源,所述第二存储源用于存储所述数据采集系统当前处理的数据的快照单元,所述快照单元用于指示当前处理对应数据的数据处理模块;The data acquisition system according to claim 1, wherein the data acquisition system further comprises a second storage source, the second storage source is used to store a snapshot unit of data currently processed by the data acquisition system, and The snapshot unit is used to indicate a data processing module that currently processes corresponding data;
    任一个数据处理模块还用于当所述任一个数据处理模块宕机并重启后,基于所述第二存储源中的快照单元,恢复对宕机前处理的数据进行数据处理。Any one of the data processing modules is further configured to resume data processing on the data processed before the outage based on the snapshot unit in the second storage source after the any one of the data processing modules is down and restarted.
  8. 根据权利要求7所述的数据采集系统,其特征在于,所述数据采集系统中第一个数据处理模块还用于在所述数据源的不同数据之间插入栏栅消息,将相邻两个栏栅消息之间的所有数据作为一个快照单元,将所述快照单元存储至所述第二存储源中,其中所述栏栅消息用于指示快照单元的起始点或结束点。The data collection system according to claim 7, wherein the first data processing module in the data collection system is further configured to insert a fence message between different data of the data source, and All data between the fence messages is used as a snapshot unit, and the snapshot unit is stored in the second storage source, where the fence message is used to indicate a start point or an end point of the snapshot unit.
  9. 根据权利要求7或8所述的数据采集系统,其特征在于,所述数据采集系统中任一个数据处理模块还用于在处理任一数据的过程中,向所述第二存储源中所述数据所属的快照单元添加所述数据处理模块的标识。The data collection system according to claim 7 or 8, wherein any one of the data processing modules in the data collection system is further configured to, in the process of processing any data, send the data to the second storage source. The snapshot unit to which the data belongs adds the identification of the data processing module.
  10. 根据权利要求1或2所述的数据采集系统,其特征在于,所述第一个数据处理模块还用于当所述第一个数据处理模块所在的节点设备满足预设条件时,降低从所述数据源中获取数据的速度。The data acquisition system according to claim 1 or 2, wherein the first data processing module is further configured to reduce a slave device when the node device where the first data processing module is located meets a preset condition. Describes the speed of obtaining data from the data source.
  11. 根据权利要求10所述的数据采集系统,其特征在于,所述数据采集系统中所述第一个数据处理模块以外的任一个数据处理模块还用于在对应的节点设备满足预设条件时,指示所述第一个数据处理模块降低从所述数据源中获取数据的速度。The data acquisition system according to claim 10, wherein any data processing module other than the first data processing module in the data acquisition system is further configured to, when a corresponding node device meets a preset condition, Instruct the first data processing module to reduce the speed of obtaining data from the data source.
  12. 根据权利要求10或11所述的数据采集系统,其特征在于,所述预设条件包括节点设备的共享内存池当前占用的内存达到第一预设阈值,所述共享内存池用于为所述节点设备的多个数据处理模块提供内存空间,所述内存空间用于缓存已接收到的数据;和/或,The data acquisition system according to claim 10 or 11, wherein the preset condition includes that a memory currently occupied by the shared memory pool of the node device reaches a first preset threshold, and the shared memory pool is configured for the A plurality of data processing modules of the node device provide a memory space, which is used to buffer the received data; and / or,
    所述预设条件包括节点设备的数据处理模块的当前申请内存的线程的总数量达到第二预设阈值。The preset condition includes that the total number of threads currently applying for memory of the data processing module of the node device reaches a second preset threshold.
  13. 根据权利要求1所述的数据采集系统,其特征在于,所述数据采集系统中任一个数据处理模块还用于基于至少一个线程,对任一批数据执行对应的数据处理操作;The data acquisition system according to claim 1, wherein any one of the data processing modules in the data acquisition system is further configured to perform a corresponding data processing operation on any batch of data based on at least one thread;
    所述任一个数据处理模块还用于当检测到流量峰值时增加并发量,所述并发量是指数据处理模块中对数据进行处理的线程的数量。Any one of the data processing modules is further configured to increase the amount of concurrency when a traffic peak is detected. The amount of concurrency refers to the number of threads that process data in the data processing module.
  14. 根据权利要求1所述的数据采集系统,其特征在于,所述多个数据处理模块分别位于多个节点设备中。The data acquisition system according to claim 1, wherein the plurality of data processing modules are respectively located in a plurality of node devices.
  15. 一种数据采集方法,其特征在于,所述方法应用于数据采集系统中,所述数 据采集系统包括多个数据处理模块,所述方法包括:A data acquisition method, characterized in that the method is applied to a data acquisition system, the data acquisition system includes a plurality of data processing modules, and the method includes:
    当第一个数据处理模块获取数据源的任一批数据时,所述第一个数据处理模块指示所述数据源提供下一批数据;When the first data processing module obtains any batch of data from the data source, the first data processing module instructs the data source to provide the next batch of data;
    当任一个数据处理模块对已接收到的任一批数据执行对应的数据处理操作时,所述任一个数据处理模块接收下一批数据;When any data processing module performs a corresponding data processing operation on any batch of data that has been received, the any data processing module receives the next batch of data;
    最后一个数据处理模块将处理后的数据存储至第一存储源中。The last data processing module stores the processed data in the first storage source.
  16. 根据权利要求15所述的方法,其特征在于,所述数据采集系统中任一个数据处理模块具有对应的内存空间,所述内存空间用于缓存已接收到的数据;The method according to claim 15, wherein any one of the data processing modules in the data acquisition system has a corresponding memory space, and the memory space is used to buffer the received data;
    所述当任一个数据处理模块对已接收到的任一批数据执行对应的数据处理操作时,所述任一个数据处理模块接收下一批数据,包括:When any data processing module performs a corresponding data processing operation on any batch of data that has been received, said any data processing module receiving the next batch of data includes:
    所述任一个数据处理模块对已接收到的任一批数据执行对应的数据处理操作时,所述任一个数据处理模块将接收到的下一批数据缓存至所述任一个数据处理模块的内存空间中;When the any data processing module performs a corresponding data processing operation on any batch of data that has been received, the any data processing module buffers the next batch of received data to the memory of the any data processing module In space
    当所述任一个数据处理模块处理所述任一批数据完成后,所述任一个数据处理模块从所述内存空间中读取所述下一批数据。When any one of the data processing modules finishes processing any of the batches of data, the any one of the data processing modules reads the next batch of data from the memory space.
  17. 根据权利要求15所述的方法,其特征在于,所述数据采集系统还包括至少一个共享内存池,每个共享内存池用于为对应的多个数据处理模块提供内存空间。The method according to claim 15, wherein the data acquisition system further comprises at least one shared memory pool, and each shared memory pool is used to provide a memory space for a corresponding plurality of data processing modules.
  18. 根据权利要求15所述的方法,其特征在于,所述当任一个数据处理模块对已接收到的任一批数据执行对应的数据处理操作时,所述任一个数据处理模块接收下一批数据,包括:The method according to claim 15, characterized in that, when any one of the data processing modules performs a corresponding data processing operation on any batch of data that has been received, the any data processing module receives the next batch of data ,include:
    所述任一个数据处理模块从对应的共享内存池中申请内存空间;Any one of the data processing modules applies for a memory space from a corresponding shared memory pool;
    当申请得到的内存空间使用完毕后,所述任一个数据处理模块将所述申请得到的内存空间释放回所述共享内存池中。When the memory space obtained by the application is used up, any one of the data processing modules releases the memory space obtained by the application back to the shared memory pool.
  19. 根据权利要求16至18中任一项所述的方法,其特征在于,所述数据采集系统中任一个数据处理模块的内存空间包括堆内内存;The method according to any one of claims 16 to 18, wherein a memory space of any one of the data processing modules in the data acquisition system includes on-chip memory;
    所述第一个数据处理模块指示所述数据源提供下一批数据之后,所述方法还包括:After the first data processing module instructs the data source to provide the next batch of data, the method further includes:
    所述任一个数据处理模块将处理后的数据推送至下一个数据处理模块的堆内内存中。Any one of the data processing modules pushes the processed data to the on-chip memory of the next data processing module.
  20. 根据权利要求19所述的方法,其特征在于,所述数据采集系统中任一个数据处理模块的内存空间还包括堆外内存;The method according to claim 19, wherein the memory space of any data processing module in the data acquisition system further comprises off-heap memory;
    所述任一个数据处理模块将处理后的数据推送至下一个数据处理模块的堆内内存中,包括:Pushing the processed data to the in-heap memory of the next data processing module by any one of the data processing modules includes:
    当所述任一个数据处理模块和下一个数据处理模块位于同一节点设备时,所述任 一个数据处理模块将处理后的数据推送至所述下一个数据处理模块的堆内内存中;或,When any one of the data processing module and the next data processing module are located on the same node device, the any data processing module pushes the processed data to the on-chip memory of the next data processing module; or,
    当所述任一个处理模块和下一个数据处理模块位于不同节点设备时,所述任一个数据处理模块对处理后的数据进行序列化,将序列化的数据存储至所述任一个数据处理模块的堆外内存中,将所述序列化后的数据从所述堆外内存推送至下一个数据处理模块的堆内内存中。When any one of the processing modules and the next data processing module are located at different node devices, the any data processing module serializes the processed data and stores the serialized data to the data processing module of any one of the data processing modules. In the off-heap memory, the serialized data is pushed from the off-heap memory to the on-heap memory of a next data processing module.
  21. 根据权利要求15所述的方法,其特征在于,所述数据采集系统还包括第二存储源,所述第二存储源用于存储所述数据采集系统当前处理的数据的快照单元,所述快照单元用于指示当前处理对应数据的数据处理模块;The method according to claim 15, wherein the data acquisition system further comprises a second storage source, the second storage source being used to store a snapshot unit of data currently processed by the data acquisition system, the snapshot A unit for instructing a data processing module that currently processes corresponding data;
    所述方法还包括:The method further includes:
    当任一个数据处理模块宕机并重启后,所述任一个数据处理模块基于所述第二存储源中的快照单元,恢复对宕机前处理的数据进行数据处理。When any one of the data processing modules is down and restarted, the any one of the data processing modules resumes data processing on the data processed before the downtime based on the snapshot unit in the second storage source.
  22. 根据权利要求21所述的方法,其特征在于,所述任一个数据处理模块基于所述第二存储源中的快照单元,恢复对宕机前处理的数据进行数据处理之前,所述方法还包括:The method according to claim 21, wherein before any one of the data processing modules is based on a snapshot unit in the second storage source, and before the data processing is resumed on the data processed before the downtime, the method further comprises :
    第一个数据处理模块在所述数据源的不同数据之间插入栏栅消息;A first data processing module inserts a fence message between different data of the data source;
    所述第一个数据处理模块将相邻两个栏栅消息之间的所有数据作为一个快照单元;The first data processing module uses all data between two adjacent fence messages as a snapshot unit;
    所述第一个数据处理模块将所述快照单元存储至所述第二存储源中,其中所述栏栅消息用于指示快照单元的起始点或结束点。The first data processing module stores the snapshot unit to the second storage source, wherein the fence message is used to indicate a start point or an end point of the snapshot unit.
  23. 根据权利要求21或22所述的方法,其特征在于,所述任一个数据处理模块基于所述第二存储源中的快照单元,恢复对宕机前处理的数据进行数据处理之前,所述方法还包括:The method according to claim 21 or 22, wherein the method is based on a snapshot unit in the second storage source, and resumes the data processing before the data processing before the downtime. Also includes:
    任一个数据处理模块在处理任一数据的过程中,所述任一个数据处理模块向所述第二存储源中所述数据所属的快照单元添加所述数据处理模块的标识。In the process of processing any data by any data processing module, the any data processing module adds the identity of the data processing module to a snapshot unit to which the data belongs in the second storage source.
  24. 根据权利要求15或16所述的方法,其特征在于,所述方法还包括:The method according to claim 15 or 16, further comprising:
    当所述第一个数据处理模块所在的节点设备满足预设条件时,所述第一个数据处理模块降低从所述数据源中获取数据的速度。When the node device where the first data processing module is located meets a preset condition, the first data processing module reduces the speed of obtaining data from the data source.
  25. 根据权利要求24所述的方法,其特征在于,所述方法还包括:The method according to claim 24, further comprising:
    当所述第一个数据处理模块以外的任一个数据处理模块对应的节点设备满足预设条件时,所述第一个数据处理模块以外的任一个数据处理模块指示所述第一个数据处理模块降低从所述数据源中获取数据的速度。When the node device corresponding to any data processing module other than the first data processing module meets a preset condition, any data processing module other than the first data processing module instructs the first data processing module Reduce the speed of obtaining data from the data source.
  26. 根据权利要求24或25所述的方法,其特征在于,所述预设条件包括节点设备的共享内存池当前占用的内存达到第一预设阈值,所述共享内存池用于为所述节点 设备的多个数据处理模块提供内存空间,所述内存空间用于缓存已接收到的数据;和/或,The method according to claim 24 or 25, wherein the preset condition comprises that a memory currently occupied by a shared memory pool of a node device reaches a first preset threshold, and the shared memory pool is used for the node device A plurality of data processing modules provided with a memory space for buffering received data; and / or,
    所述预设条件包括节点设备的数据处理模块的当前申请内存的线程的总数量达到第二预设阈值。The preset condition includes that the total number of threads currently applying for memory of the data processing module of the node device reaches a second preset threshold.
  27. 根据权利要求15所述的方法,其特征在于,所述数据采集系统中任一个数据处理模块还用于基于至少一个线程,对任一批数据执行对应的数据处理操作;The method according to claim 15, wherein any one of the data processing modules in the data acquisition system is further configured to perform a corresponding data processing operation on any batch of data based on at least one thread;
    所述方法还包括:The method further includes:
    所述任一个数据处理模块还用于当检测到流量峰值时增加并发量,所述并发量是指数据处理模块中对数据进行处理的线程的数量。Any one of the data processing modules is further configured to increase the amount of concurrency when a traffic peak is detected. The amount of concurrency refers to the number of threads that process data in the data processing module.
  28. 根据权利要求15所述的方法,其特征在于,所述多个数据处理模块分别位于多个节点设备中。The method according to claim 15, wherein the plurality of data processing modules are respectively located in a plurality of node devices.
  29. 一种节点设备,其特征在于,所述节点设备包括处理器和存储器,所述存储器中存储有至少一条指令,所述指令由所述处理器加载并执行以实现如权利要求15至权利要求28任一项所述的数据采集方法所执行的操作。A node device, wherein the node device includes a processor and a memory, and the memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement claims 15 to 28 Operations performed by the data collection method of any one.
  30. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条指令,所述指令由所述处理器加载并执行以实现如权利要求15至权利要求28任一项所述的数据采集方法所执行的操作。A computer-readable storage medium, characterized in that at least one instruction is stored in the storage medium, and the instruction is loaded and executed by the processor to implement the method according to any one of claims 15 to 28. The operations performed by the data collection method.
PCT/CN2019/087226 2018-05-25 2019-05-16 Data acquisition system and method, node device and storage medium WO2019223599A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810515496.4A CN110597890B (en) 2018-05-25 2018-05-25 Data acquisition system, data acquisition method, node device, and storage medium
CN201810515496.4 2018-05-25

Publications (1)

Publication Number Publication Date
WO2019223599A1 true WO2019223599A1 (en) 2019-11-28

Family

ID=68616079

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/087226 WO2019223599A1 (en) 2018-05-25 2019-05-16 Data acquisition system and method, node device and storage medium

Country Status (2)

Country Link
CN (1) CN110597890B (en)
WO (1) WO2019223599A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597371A (en) * 2020-12-25 2021-04-02 牧原食品股份有限公司 Data acquisition system, method and device based on message middleware
CN114070563A (en) * 2020-07-31 2022-02-18 中移(苏州)软件技术有限公司 Data processing method, device, terminal and storage medium

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767130A (en) * 2020-07-01 2020-10-13 菏泽学院 Data acquisition system, method, equipment and storage medium
CN112543134B (en) * 2020-12-08 2022-08-09 航天科技控股集团股份有限公司 CAN data storage system based on T-BOX platform
CN112597247B (en) * 2020-12-25 2022-05-31 杭州数梦工场科技有限公司 Data synchronization method and device
CN113630442B (en) * 2021-07-14 2023-09-12 远景智能国际私人投资有限公司 Data transmission method, device and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103198109A (en) * 2013-03-28 2013-07-10 北京圆通慧达管理软件开发有限公司 Data processing system and method
CN106790642A (en) * 2017-01-10 2017-05-31 深圳淞鑫金融服务科技发展有限公司 The dispatching method and device of big data acquisition tasks
CN107040608A (en) * 2017-05-19 2017-08-11 宁波绮耘软件股份有限公司 A kind of data processing method and system
CN107908471A (en) * 2017-09-26 2018-04-13 聚好看科技股份有限公司 A kind of tasks in parallel processing method and processing system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7865529B2 (en) * 2003-11-18 2011-01-04 Intelligent Model, Limited Batch processing apparatus
CN107395669B (en) * 2017-06-01 2020-04-07 华南理工大学 Data acquisition method and system based on streaming real-time distributed big data
CN108021067A (en) * 2017-12-14 2018-05-11 浙江晨泰科技股份有限公司 A kind of creation data intelligent acquisition system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103198109A (en) * 2013-03-28 2013-07-10 北京圆通慧达管理软件开发有限公司 Data processing system and method
CN106790642A (en) * 2017-01-10 2017-05-31 深圳淞鑫金融服务科技发展有限公司 The dispatching method and device of big data acquisition tasks
CN107040608A (en) * 2017-05-19 2017-08-11 宁波绮耘软件股份有限公司 A kind of data processing method and system
CN107908471A (en) * 2017-09-26 2018-04-13 聚好看科技股份有限公司 A kind of tasks in parallel processing method and processing system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114070563A (en) * 2020-07-31 2022-02-18 中移(苏州)软件技术有限公司 Data processing method, device, terminal and storage medium
CN114070563B (en) * 2020-07-31 2023-09-05 中移(苏州)软件技术有限公司 Data processing method, device, terminal and storage medium
CN112597371A (en) * 2020-12-25 2021-04-02 牧原食品股份有限公司 Data acquisition system, method and device based on message middleware

Also Published As

Publication number Publication date
CN110597890B (en) 2022-04-05
CN110597890A (en) 2019-12-20

Similar Documents

Publication Publication Date Title
WO2019223599A1 (en) Data acquisition system and method, node device and storage medium
US7801852B2 (en) Checkpoint-free in log mining for distributed information sharing
US7783601B2 (en) Replicating and sharing data between heterogeneous data systems
CN105493474B (en) System and method for supporting partition level logging for synchronizing data in a distributed data grid
JP2002091938A (en) System and method for processing fail-over
US20120331333A1 (en) Stream Data Processing Failure Recovery Method and Device
CN107046510B (en) Node suitable for distributed computing system and system composed of nodes
CN101957863A (en) Data parallel processing method, device and system
CN109388481B (en) Transaction information transmission method, system, device, computing equipment and medium
WO2020232875A1 (en) Actor model-based task scheduling method and apparatus, and storage medium
WO2020025049A1 (en) Data synchronization method and apparatus, database host, and storage medium
WO2019109854A1 (en) Data processing method and device for distributed database, storage medium, and electronic device
JPWO2012049794A1 (en) Distributed processing apparatus and distributed processing system
CN103634411A (en) Real-time market data broadcasting system and real-time market data broadcasting method with state consistency
CN101093454A (en) Method and device for executing SQL script file in distributed system
CN115994053A (en) Parallel playback method and device of database backup machine, electronic equipment and medium
CN112465046A (en) Method, system, equipment and medium for artificial intelligence training of mass small files
CN105407044A (en) Method for implementing cloud storage gateway system based on network file system (NFS)
WO2023165484A1 (en) Distributed task processing method, distributed system, and first device
CN112035255A (en) Thread pool resource management task processing method, device, equipment and storage medium
CN111984393A (en) Distributed large-scale real-time data scheduling engine system and data scheduling method thereof
CN110955461B (en) Processing method, device, system, server and storage medium for computing task
CN105760215A (en) Map-reduce model based job running method for distributed file system
CN115934304A (en) Data processing method and device, computer equipment and readable storage medium
JP6036690B2 (en) Distributed execution system and distributed program execution method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19807912

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19807912

Country of ref document: EP

Kind code of ref document: A1