CN114064312A

CN114064312A - Data processing system and model training method

Info

Publication number: CN114064312A
Application number: CN202111332051.0A
Authority: CN
Inventors: 郭峰
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2021-11-11
Filing date: 2021-11-11
Publication date: 2022-02-18

Abstract

The embodiment of the invention provides a data processing system and a model training method, and relates to the technical field of big data processing. The system comprises at least one task node, each task node comprising: the system comprises at least one data calculation sub-node, at least one model training sub-node and a shared memory; each data calculation sub-node is a sub-node in the real-time data computing system, and each model training sub-node is a sub-node in the model training system; each data calculation sub-node is used for executing appointed processing operation on the acquired first to-be-processed real-time data to obtain a first processing result, and storing the first processing result in the shared memory; and each model training child node is used for reading the first processing result from the shared memory and performing model training by using the first processing result to obtain a trained target model. Compared with the prior art, the scheme provided by the embodiment of the invention can improve the timeliness of the processing result of the real-time data.

Description

Data processing system and model training method

Technical Field

The invention relates to the technical field of big data processing, in particular to a data processing system and a model training method.

Background

Currently, with the improvement of the timeliness requirement of data processing in various industries, many real-time data computing systems have appeared to process real-time data to obtain processing results, such as real-time data characteristics; with the increase of timeliness requirements of various model training algorithms on the processing results, the processing results processed by the real-time data computing system are required to be efficiently pushed to the model training system for training.

However, in the related art, due to the lack of integration of the real-time data computing system and the model training system, the real-time data computing system processes the obtained processing result in real time, and the processing result can be read and utilized by the model training system after the real-time data computing system falls off the disk, so that the timeliness of the processing result of the real-time data is poor.

Based on this, how to integrate the real-time data computing system and the model training system and improve the timeliness of the processing result of the real-time data is called as a problem to be solved at present.

Disclosure of Invention

The embodiment of the invention aims to provide a data processing system and a model training method, so as to realize the integration of a real-time data computing system and a model training system and improve the timeliness of the processing result of real-time data. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a data processing system, where the system includes at least one task node, and each task node includes: the system comprises at least one data calculation sub-node, at least one model training sub-node and a shared memory; each data calculation sub-node is a sub-node in the real-time data calculation system, and each model training sub-node is a sub-node in the model training system;

each data calculation sub-node is used for executing appointed processing operation on the acquired first to-be-processed real-time data to obtain a first processing result, and storing the first processing result into the shared memory;

and each model training child node is used for reading the first processing result from the shared memory and performing model training by using the first processing result to obtain a trained target model.

Optionally, in a specific implementation manner, each model training child node is further configured to store the target model in the shared memory; each data calculation sub-node is further configured to acquire the target model from the shared memory, process the acquired second to-be-processed real-time data by using the target model to obtain a second processing result, and store the second processing result in the shared memory.

Optionally, in a specific implementation manner, each task node is provided with a shared memory management service; each data computation sub-node is further configured to:

registering the data information of the first processing result in the shared memory management service;

acquiring a first reference address of the data information in the shared memory management service, and sending the first reference address to each model training child node;

wherein the data information comprises: the node identification of the data calculation sub-node, the storage address of the first processing result in the shared memory and the variable value corresponding to the first processing result;

each model training child node reads the first processing result from the shared memory, including:

each model training child node reads the data information from the shared memory management service according to the received first reference address, and reads the first processing result from the shared memory according to the data information;

each data calculation child node stores the first processing result to the shared memory, including:

each data calculation sub-node stores the first processing result to the shared memory according to a first memory protocol; wherein the data information comprises: a protocol identification of the first memory protocol.

Optionally, in a specific implementation manner, each task node is provided with a shared memory management service; each model training sub-node is further configured to:

registering the model information of the target model into the shared memory management service;

acquiring a second reference address of the model information in the shared memory management service, and sending the second reference address to each data calculation child node;

wherein the model information includes: node identification of the model training child node, storage address of the target model in the shared memory and variable value corresponding to the target model;

each data computation sub-node obtains the target model from the shared memory, and the method comprises the following steps:

each data calculation sub-node reads the model information from the shared memory management service according to the received second reference address, and reads the target model from the shared memory according to the model information;

each model training child node stores the target model to the shared memory, including:

each model training child node stores the target model into the shared memory according to a second memory protocol; wherein the model information includes: a protocol identification of the second memory protocol.

Optionally, in a specific implementation manner, the real-time data computing system is: a distributed real-time data computing system; the model training system is as follows: a distributed model training system; each task node is provided with a state management service, and each data processing sub-node is further used for:

acquiring the first to-be-processed real-time data from a message queue of at least one data source;

inserting appointed marks into each message queue according to a preset period; wherein the specified identification comprises: partition information of the message queue and an offset address of currently read data in the message queue;

in the state management service, recording data states corresponding to all message queues determined based on all the specified identifications; wherein, the data state corresponding to each message queue is used for characterizing: and the offset address of the data acquired from the message queue by each data computing node in the message queue.

Optionally, in a specific implementation manner, the number of the data sources is multiple; before each data processing sub-node performs the specified processing operation on the acquired first to-be-processed real-time data, each data computing sub-node is further configured to:

determining whether the specified identifiers inserted in the message queue of each data source are aligned;

and if the data are aligned, executing specified processing operation on the acquired first real-time data to be processed.

Optionally, in a specific implementation manner, after the data computing sub-nodes are restarted after being down, each data computing sub-node is further configured to:

acquiring data states corresponding to all message queues recorded in the state management service;

each data calculation sub-node acquires the first to-be-processed real-time data from a message queue of at least one data source, and the method comprises the following steps:

and each data calculation sub-node acquires the first to-be-processed real-time data from each message queue from the offset address represented by each acquired data state.

In a second aspect, an embodiment of the present invention provides a model training method, where the method is applied to any target model training child node in any target task node of a data processing system; wherein the data processing system comprises: at least one task node, each task node comprising: the system comprises at least one data calculation sub-node, at least one model training sub-node and a shared memory, wherein each data sub-node is a sub-node in a real-time data computing system, and each model training sub-node is a sub-node in a model training system; the method comprises the following steps:

reading a first processing result from a target shared memory included in a target task node to which the target model training child node belongs; the first processing result is obtained by each first data calculation sub-node included in the target task node executing a designated processing operation on the acquired first to-be-processed real-time data, and the first to-be-processed real-time data is stored in the target shared memory;

and performing model training by using the read first processing result to obtain a trained target model.

Optionally, in a specific implementation manner, the real-time data computing system is: a distributed real-time data computing system; the model training system is as follows: a distributed model training system; the method further comprises the following steps:

and storing the target model into the target shared memory, so that each first data calculation sub-node included in the target task node acquires the target model from the target shared memory, processes the acquired second to-be-processed real-time data by using the target model to obtain a second processing result, and stores the second processing result into the target shared memory.

Optionally, in a specific implementation manner, each task node is provided with a shared memory management server, and the method further includes:

registering the model information of the target model into a target shared memory management service set by the target task node;

acquiring a second reference address of the model information in the target shared memory management service, and sending the second reference address to each data calculation sub-node included in the target task node; wherein the model information includes: the node identification of the target model training child node, the storage address of the target model in the target shared memory and the variable value corresponding to the target model;

the storing the target model to the target shared memory so that each first data computation child node included in the target task node obtains the target model from the target shared memory includes:

according to a second memory protocol, storing the target model into the target shared memory, so that each data computation sub-node included in the target task node reads the model information from the target shared memory management service according to the received second reference address, and reads the target model from the target shared memory according to the model information; wherein the model information includes: a protocol identification of the second memory protocol.

The embodiment of the invention has the following beneficial effects:

by applying the scheme provided by the embodiment of the invention, a data processing system can be established on the basis of a real-time data computing system and a model training system. Wherein the data processing system comprises at least one task node, and each task node comprises: at least one data computation sub-node in the real-time data computation system and at least one model training sub-node in the model training system, and each task node may further include a shared memory.

Furthermore, in the operation process of the data processing system, for each task node, each data computation sub-node in the task node can acquire real-time data to be processed, and perform specified processing operation on the acquired real-time data to obtain a processing result. The data computation child node may then store the processing result in the shared memory of the task node. Therefore, each model training sub-node in the task node can directly read the processing result from the shared memory of the task node, and model training is performed by using the read processing result to obtain a trained target model.

Based on this, in the embodiment of the present invention, the data computation sub-node in the real-time data computing system and the model training sub-node in the model training system may be cooperatively deployed on the same task node, so that the integration of the real-time data computing system and the model training system may be realized. Furthermore, the processing result obtained by processing the real-time data by the data calculation sub-node can be directly put into the memory of the task node to which the data calculation sub-node belongs, and the model training sub-node can directly read the processing result from the memory of the task node to which the data calculation sub-node belongs for model training. Therefore, the process of falling from the disk when the processing result is transmitted between the data calculation sub-node and the model training sub-node can be reduced, the high-efficiency transmission of the processing result between the data calculation sub-node and the model training sub-node is realized, and the timeliness of the processing result of the real-time data is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by referring to these drawings.

FIG. 1 is a block diagram of a data processing system according to an embodiment of the present invention;

FIG. 2 is a block diagram of another data processing system according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a specific example of a specific implementation provided by an embodiment of the present invention;

fig. 4 is a schematic diagram of a specific example of a specific implementation manner provided in the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived from the embodiments given herein by one of ordinary skill in the art, are within the scope of the invention.

In the related art, due to the lack of integration of the real-time data computing system and the model training system, the real-time data computing system processes the obtained processing result in real time, and the processing result can be read and utilized by the model training system only after the real-time data computing system falls off the disk, so that the timeliness of the processing result of the real-time data is poor. Based on this, how to integrate the real-time data computing system and the model training system and improve the timeliness of the processing result of the real-time data is called as a problem to be solved at present.

In order to solve the above technical problem, an embodiment of the present invention provides a data processing system.

The data processing system can be suitable for various application scenarios which have requirements on data processing timeliness and need to use the processing result of real-time data for model training. For example, application scenes, various types of wind control scenes and the like are recommended in real time.

In addition, in the application scenario, the real-time data needs to be processed by the real-time data computing system to obtain a processing result, and the model training system needs to perform model training by using the processing result to obtain a trained target model, so that the data processing system is established based on various real-time data computing systems and various model training systems.

For example, the real-time data computing system may be: a flux (real-time log collection system), a Spark Streaming (Streaming data processing system), a Storm (distributed real-time big data processing system), a Flink (distributed Streaming data Streaming processing system), etc., and the model training system may be: TensorFlow, PyTorch, Caffe (Convolutional neural network Architecture), Ray, etc., which may be used to establish the data processing system by combining the above-mentioned Flink and Ray.

It should be emphasized that, in the embodiments of the present invention, the application scenario of the data processing system, and the real-time data computing system and the model training system for establishing the data processing system are not limited, so that the real-time data computing system and the model training system can be selected to be matched according to the requirements of the application scenario in practical application to establish the data processing system provided in the embodiments of the present invention.

The data processing system provided by the embodiment of the invention comprises at least one task node, wherein each task node comprises: the system comprises at least one data calculation sub-node, at least one model training sub-node and a shared memory; each data calculation sub-node is a sub-node in the real-time data calculation system, and each model training sub-node is a sub-node in the model training system;

Therefore, the data processing system can be established on the basis of the real-time data computing system and the model training system by applying the scheme provided by the embodiment of the invention. Wherein the data processing system comprises at least one task node, and each task node comprises: at least one data computation sub-node in the real-time data computation system and at least one model training sub-node in the model training system, and each task node may further include a shared memory.

A data processing system according to an embodiment of the present invention will be specifically described below with reference to the accompanying drawings.

Fig. 1 is a schematic structural diagram of a data processing system according to an embodiment of the present invention. As shown in FIG. 1, the data processing system includes at least one task node 110, each task node including at least one data computation sub-node 101, at least one model training node 102, and shared memory 103.

Each data computation sub-node 101 is a data computation sub-node in a real-time data computation system, and each model training sub-node 102 is a model training sub-node in a model training system.

Optionally, as shown in fig. 2, the real-time data computing system to which each data computing sub-node 101 belongs is: a distributed real-time data computing system; correspondingly, the model training system to which each of the model training sub-nodes 102 belongs is: a distributed model training system.

When the distributed real-time data computing system and the distributed model training system both have a master node for system management, the master node in the distributed real-time data computing system and the master node in the distributed model training system may be set on the same task node in the data processing system provided in the embodiment of the present invention.

Further, in the embodiment of the present invention, each task node may be understood as a data processing server, each data calculation sub-node 101 and each model training node 102 included in each task node are a data calculation program for performing real-time data processing and a model training program for performing model training, the shared memory is a memory space in the data processing server, and the determined memory space is used for storing a processing result obtained by processing real-time data by each data calculation sub-node 101, so that the memory space is read and utilized by each model training node 102.

That is to say, in the embodiment of the present invention, the data computation sub-node 101 and the model training sub-node 102 belonging to the real-time data computation system and the model training system are integrated into the same data processing server, that is, two programs for performing real-time data processing and model training, which belong to the real-time data computation system and the model training system, are integrated into the same hardware device, and further, in the memory of the hardware device, a memory space is determined for storing a processing result obtained by processing the real-time data by the program for performing real-time data processing, so that the program for performing model training can read the processing result from the memory space for performing model training. Thus, since the determined memory space may be commonly used by the data computation child node 101 and the model training child node 102, the determined memory space may be referred to as a shared memory.

Wherein, for each task node 110, each data computation sub-node 101 in the task node 110 may be configured to: and acquiring first real-time data to be processed, and executing appointed processing operation on the acquired first real-time data to be processed to obtain a first processing result.

For example, the first processing operation may be: filling and cleaning missing values, processing associated attribute values of the dimension table, encoding data one-hot, processing buckets and the like; of course, the above-mentioned operations are merely illustrative of the first processing operation, and are not limiting.

After obtaining the first processing result, each data computing sub-node 101 may store the first processing result in the shared memory 103 included in the task node 110.

Furthermore, each model training child node 102 in the task node 110 may read the first processing result stored by each data computing child node 101 from the shared memory 103 included in the task node 110, and perform model training using the read first processing result to obtain a trained target model.

Each model training sub-node 102 may be pre-stored with a designated algorithm, so that after the first processing result is read, the designated algorithm may be trained by using the first processing result, and a trained model may be obtained.

Optionally, each model training sub-node 102 in the task node 110 may communicate with a service system for executing various data processing tasks, so that each model training sub-node 102 may directly send each model to the service system after obtaining the trained model, so that the service system may complete the corresponding data processing task by using each received model.

It is understood that, in many cases, for each task node, each data computation sub-node 101 needs to use each model training sub-node 102 to train the completed model implementation when processing the acquired real-time data.

Therefore, in order to ensure that each data computation sub-node 101 can efficiently obtain the model trained by each model training sub-node 102, after each model training sub-node 102 obtains the target model through training, the target model may also be stored in the shared memory of the task node, so as to be obtained and used by each data computation sub-node 101 in the task node.

Based on this, optionally, in a specific implementation,

each model training child node 102 is further configured to store a target model in the shared memory 103;

each data calculation child node 101 is further configured to obtain a target model from the shared memory 103, process the obtained second to-be-processed real-time data by using the target model to obtain a second processing result, and store the second processing result in the shared memory 103.

In this specific implementation manner, for each task node 110, each model training child node 102 in the task node 110 may also store an object model in the shared memory 103 of the task node 110 after the object model is obtained through training. In this way, when acquiring the second to-be-processed real-time data, each data calculation sub-node 101 may acquire the target model from the shared memory 103 of the task node 110, and thus, may process the acquired second to-be-processed real-time data by using the acquired target model to obtain a second processing result.

Further, considering that each model training sub-node 102 in the task node 110 may need to perform model training using the second processing result, each data computing sub-node 101 may continue to store the obtained second processing result in the shared memory 103 of the task node 110.

That is, in this specific implementation manner, for each task node 110, the data computation sub-nodes 101 and the model training sub-nodes 102 in the task node 110 may implement information interaction.

For each task node 110, each data computation sub-node 101 in the task node 110 may obtain, from the shared memory 103 of the task node 110, a target model trained by each model training sub-node 102 in the task node 110, process the obtained real-time data by using the target model, and further store the processing result again in the shared memory 103 of the task node 110; each model training child node 102 in the task node 110 may obtain, from the shared memory 103 of the task node 110, a processing result obtained by processing by each data computation child node 101 in the task node 110, perform model training using the processing result, and further store a target model obtained by training in the shared memory 103 of the task node 110 again.

By circulating in this way, efficient information output between each data computation sub-node 101 and each model training sub-node 102 in the task node 110 can be realized, and the efficiency of real-time data processing and model training is improved.

Moreover, from a system perspective, since the processing result and the target model are only stored in the shared memory 103, it is not necessary to occupy each data calculation sub-node 101 to store the target model, and it is also not necessary to occupy the memory of each model training sub-node 102 to store the processing result, so that memory waste caused by two memories occupied by one data can be reduced.

Optionally, in a specific implementation manner, a shared memory management service may be provided in each task node 110. The shared memory management service may be understood as a class description, for example, a java or C + + code executed in a process, or a code attached to each data computation child node 101 or inside each model training child node 102. The shared memory management service may provide a storage area for storing various types of data information in the shared memory 103, for example, the data information of the first processing result, the model information of the target model, and the like, and for example, the data information of the first processing result may include: variable values, memory physical addresses, footprint size, etc. Furthermore, the shared memory service needs to provide an external access port so that other processes or internal threads can access the external access port.

Furthermore, in this specific implementation manner, for each task node 110, each data computation child node 101 in the task node 110 may be further configured to execute the following steps 11 to 12:

step 11: registering data information of the first processing result in a shared memory management service;

step 12: acquiring a first reference address of data information in shared memory management service, and sending the first reference address to each model training child node;

wherein the data information includes: the node identification of the data calculation sub-node, the storage address of the first processing result in the shared memory and the variable value corresponding to the first processing result.

In this specific implementation manner, after each data computation sub-node 101 in the task node 110 stores the obtained first processing result in the shared memory 103 in the task node 110, the data information of the first processing result may be determined.

The data information of the first processing result may be used to uniquely determine the first processing result from the shared memory 103 in the task node 110, that is, the data information uniquely corresponds to the first processing result, so that the first processing result may be read from the shared memory 103 in the task node 110 according to the data information.

Based on this, the data information may include: the node id of the data calculation sub-node 101, the storage address of the first processing result in the shared memory 103 of the task node 110, and the variable value corresponding to the first processing result. The variable value corresponding to the first processing result is: the reference variable of the first processing result, optionally, the variable value may have a structure of: and generating the node identifier of the data calculation child node of the first processing result, the storage address of the first processing result in the shared memory of the task node and the data size of the first processing result. The following are exemplary: node 2&0xffffffff &20kb (Kilobyte ).

After determining the data information of the first processing result, each data computation sub-node 101 in the task node 110 may register the data information in the shared memory management service.

For example, the shared memory management service is a piece of JAVA code executed in a process, the data information of the first processing result may be packaged into a data packet, and a variable value corresponding to the data packet is set, so that the variable value is written into the JAVA code, and the data packet is stored in a specified storage location in the task node 110.

Furthermore, after the registration of the data information of the first processing result is completed, each data computation sub-node 101 in the task node 110 may obtain a first reference address of the data information of the first processing result in the shared memory management service, and send the first reference address to each model training sub-node 102 in the task node 110.

The first reference address is used for determining data information of the first processing result in the shared memory management service, that is, the first reference address may be understood as: and the storage address of the data information of the first processing result in the shared memory management service. The first reference address may uniquely identify the data information of the first processing result in the shared memory management service, that is, the first reference address uniquely corresponds to the data information of the first processing result, so that the data information of the first processing result may be identified from the shared memory management service according to the first reference address.

Wherein the first reference address may be a variable. The variables include: and generating the node identifier of the data calculation child node of the first processing result, the storage address of the data information of the first processing result in the shared memory management service and the data size of the data information of the first processing result. For example: node 1&0xffffffff &5 kb.

Optionally, each data computation sub-node 101 in the task node 110 may send the first reference address to each model training sub-node 102 in the task node 110 by using a custom function.

Based on this, in this specific implementation, the reading of the first processing result from the shared memory 103 by each model training child node may include the following step 13:

step 13: each model training child node 120 reads data information from the shared memory management service according to the received first reference address, and reads a first processing result from the shared memory 103 according to the data information.

Because the first reference address uniquely corresponds to the data information of the first processing result, each model training child node 102 in the task node 110 can read the data information of the first processing result from the shared memory management service of the task node 110 according to the received first reference address; furthermore, since the data information of the first processing result uniquely corresponds to the first processing result, each model training child node 102 in the task node 110 can read the first processing result from the shared memory 103 of the task node 110 according to the data information of the first processing result.

Exemplarily, as shown in fig. 3, it is a schematic diagram of a specific example of the present specific implementation.

Further, when each first data calculation sub-node 101 stores the first processing result in the shared memory 103, it may use different memory protocols to convert the first processing result into different formats for storage.

For example, the Arrow memory protocol may be used to convert the first processing result into an Arrow listing data format for storage. The Arrow memory protocol can be regarded as a general memory protocol, so that a standard data exchange format can be provided, seamless linking of data among different systems is realized, and time required by data format conversion in a data transmission process among the different systems is saved.

Based on this, optionally, in a specific implementation manner, the step of storing the first processing result in the shared memory 103 by each data calculation sub-node 101 may include the following step 14:

step 14: each data calculation sub-node stores the first processing result to the shared memory according to a first memory protocol;

wherein the data information includes: a protocol identification of a first memory protocol.

In this specific implementation manner, each data computation sub-node 101 may store the first processing result in the shared memory 103 according to a preset first memory protocol. Moreover, in order to ensure that each model training sub-node 102 can smoothly read the first processing result, each data calculation sub-node 101 needs to notify each model training sub-node 102 of the adopted first memory protocol.

Based on this, the protocol identifier of the adopted first memory protocol may be added to the data information of the first processing result, so that each model training sub-node 102 may obtain the adopted first memory protocol from the data information after obtaining the data information of the first processing result.

Of course, optionally, each data computation sub-node 101 may also directly send a notification about the first memory protocol to each model training sub-node 102, so as to notify each model training sub-node 102 that the first processing result is stored in the shared memory 103 according to a format corresponding to the first memory protocol.

Similar to the above specific implementation manner, in the case that each model training sub-node 102 stores the trained target model in the shared memory 103, each model training byte 102 needs to inform each data computing sub-node 101 of the relevant information about the stored target model.

Based on this, optionally, in a specific implementation manner, each task node 110 is provided with a shared memory management service;

furthermore, in this specific implementation, for each task node 110, each model training child node 103 in the task node 110 may be further configured to execute the following steps 21 to 22:

step 21: registering the model information of the target model into a shared memory management service;

step 22: acquiring a second reference address of the model information in the shared memory management service, and sending the second reference address to each data calculation child node;

wherein the model information includes: the node identification of the model training child node, the storage address of the target model in the shared memory and the variable value corresponding to the target model;

accordingly, in this specific implementation manner, the step of obtaining the target model from the shared memory 103 by each data computation child node 110 may include the following step 23:

step 23: each data computation sub-node 101 reads the model information from the shared memory management service according to the received second reference address, and reads the target model from the shared memory 103 according to the model information.

Optionally, in a specific implementation manner, the step of storing the target model in the shared memory 103 by each model training sub-node 102 may include the following step 24:

step 24: each model training child node stores the target model into the shared memory according to the second memory protocol;

wherein the model information includes: a protocol identification of the second memory protocol.

It should be noted that the specific implementation manners of the above steps 21-24 are similar to the specific implementation manners of the above steps 11-14, and are not described herein again.

In many cases, the task node 110 may fail to cause downtime, and then, in order to avoid the situation that the data is repeatedly used after the downtime restart of the task node 110, state management may be performed on the data used by each data computation sub-node 101 and the model training sub-node 102 in the task node 110, so as to implement an exact-once semantic of the data.

Based on this, optionally, in a specific implementation manner, each task node 110 is provided with a state management service, and each data processing sub-node 101 may further be configured to execute the following steps 31 to 33:

step 31: acquiring first to-be-processed real-time data from a message queue of at least one data source;

step 32: inserting appointed marks into each message queue according to a preset period;

wherein, the appointed identification comprises: partition information of the message queue and an offset address of currently read data in the message queue;

step 33: in the state management service, recording the data state corresponding to each message queue determined based on each specified identification;

wherein, the data state corresponding to each message queue is used for characterizing: and the offset address of the data acquired from the message queue by each data computing node in the message queue.

In this specific implementation manner, each data processing sub-node 101 may obtain the first to-be-processed real-time data from the message queue of at least one data source; therefore, the specified identifier can be inserted into each message queue for reading the first to-be-processed real-time data according to the preset period and the time length corresponding to each preset period.

Wherein, the inserted appointed mark in each message queue comprises: partition information of the message queue and an offset address of the currently read data in the message queue. Since there may be multiple partitions in the message queue, it is necessary to determine the message queue partition in which the currently read data is located and the offset address of the currently read data in the message queue partition. That is, the partition information of the message queue included in the above-mentioned specific identifier is: the information of the message queue partition to which the currently read data belongs, and the offset address of the currently read data in the message queue included in the above-mentioned specified identifier is: offset address of currently read data in the belonging message queue partition. For example, the specified identification may be: barrier ID.

In this way, according to the specified identifier inserted in each message queue, each data processing sub-node 101 can determine which data has been read by itself and which data has not been read by itself in the message queue in which it is performing data reading. Furthermore, the data state corresponding to each message queue can be determined by each designated identifier, and in the set state management service, the data state corresponding to each message queue determined based on each designated identifier is recorded, and the data state is used for representing: which data has been read and which data has not been read in the message queue. Wherein the data state may include: node identification of the data processing sub-node, partition information of a message queue, an offset address of currently read data in the message queue, Barrier ID, statistical information of the data processing sub-node, other related state information and the like; for example, the statistical information of the data processing sub-node may include sum information or count information, and other relevant state information may include bucket information, etc. This is all reasonable.

The recording of the data state corresponding to each message queue is as follows: and persisting the data state corresponding to each determined message queue, and marking the designated identifier corresponding to the persistence.

Based on this, optionally, in a specific implementation manner, on the basis of the above steps 31 to 33, after the shutdown and restart of each data computing sub-node 101, each data computing sub-node 101 may further be configured to execute the following step 34:

step 34: acquiring data states corresponding to all message queues recorded in a state management service;

accordingly, in this specific implementation manner, each data calculation sub-node obtains the first to-be-processed real-time data from the message queue of at least one data source, and may include the following step 311:

step 311: and each data calculation sub-node acquires the first to-be-processed real-time data from each message queue from the offset address represented by each acquired data state.

In this specific implementation manner, after each data computing sub-node 101 is down and restarted, in order to implement an exact-once semantic of data, each data computing sub-node 101 may first acquire a data state corresponding to each message queue recorded in the state management service, so as to determine an offset address of each partition of each message queue of data acquired by the data computing node 101 from each message queue, that is, determine which data has been read and which data has not been read in each message queue.

In this way, for each message queue, in order to avoid repeatedly reading data in the message queue, each data computation sub-node 101 may obtain, in the message queue partition characterized by the data state corresponding to the message queue, the first to-be-processed real-time data from the message queue starting from the offset address characterized by the data state corresponding to the message queue.

That is to say, each data computation sub-node 101 may first determine, according to the data state corresponding to each message queue, a message queue partition in which the data that was last read before the downtime is located, and an offset address of the data that was last read before the downtime is located in the message queue partition, so that, after the downtime is restarted, each data computation sub-node 101 may obtain, in the determined message queue partition, the first to-be-processed real-time data from the determined offset address.

In this way, for each message queue, each data calculation sub-node 101 may sequentially acquire, starting from the first data in the unread data in the message queue, the unread data before the downtime restart, so as to acquire the first to-be-processed real-time data.

Further, when the number of the data sources is multiple, each data computation sub-node 101 may obtain the first to-be-processed real-time data from the message queues of the multiple data sources.

In this way, the respective message queues may be flow aligned, i.e. the respective message queues have an association in time, such that the respective message queues need to be aligned based on the inserted specified identities. The appointed identifications inserted in each message queue all reach the offset position of the data corresponding to the same time point in the message queue.

Based on this, optionally, in a specific implementation manner, before performing the specified processing operation on the acquired first to-be-processed real-time data, each data processing sub-node 101 may further be configured to perform the following step 35:

step 35: determining whether the specified identifiers inserted in the message queue of each data source are aligned; and if the data are aligned, executing specified processing operation on the acquired first real-time data to be processed.

In this specific implementation manner, after acquiring the first to-be-processed real-time data, each data processing child node 101 may first determine whether the designated identifiers inserted in the message queue of each data source are aligned.

When the determination result is alignment, it indicates that each message queue is aligned based on the entered and exited specified identifier, so that each data processing child node 101 can perform specified processing operation on the acquired first to-be-processed real-time data.

When the determination result is that the data source is not aligned, it indicates that the message queues are not aligned based on the inserted specified identifiers, so that the data processing child nodes 101 do not perform specified processing operations on the acquired first to-be-processed real-time data, but wait for the specified identifiers inserted in the message queues of each data source to be aligned, and then perform specified processing operations on the acquired first to-be-processed real-time data.

The alignment of the specified identifier to be inserted into the message queue of each data source is as follows: for the message queue with the later time corresponding to the inserted specified identifier, the reading of data from the message queue may be stopped, and the reading of data from the message queue with the earlier time corresponding to the inserted specified identifier may be continued until the time corresponding to the inserted specified identifier in the message queue is the same as the time corresponding to the later specified identifier, and then the specified processing operation may be performed on the acquired first to-be-processed real-time data.

For example, as shown in fig. 4, the calculation worker is a data calculation sub-node, the training worker is a model training sub-node, and the two consumption data queues are respectively message queues corresponding to two data sources used by the calculation worker to obtain the first to-be-processed real-time data; the state management is a state management service set for one task node, and the characteristic engineering is that a calculator performs a first processing operation on the acquired first real-time data to be processed.

Therefore, when the computing worker acquires the first to-be-processed real-time data from the two consumption data queues and performs the first processing operation on the acquired first to-be-processed real-time data, the computing worker needs to wait for the two Barrier IDs in the two consumption data queues to reach the offset position of the data corresponding to the same time point in the message queue, and persist the data states corresponding to the two consumption data queues. Among them, the term "permanent" means: the data states corresponding to the two queues of consumption data are stored as files, and the files may be stored in a distributed storage system.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A data processing system, the system comprising at least one task node, each task node comprising: the system comprises at least one data calculation sub-node, at least one model training sub-node and a shared memory; each data calculation sub-node is a sub-node in the real-time data calculation system, and each model training sub-node is a sub-node in the model training system;

2. The system of claim 1,

each model training child node is also used for storing the target model into the shared memory;

each data calculation sub-node is further configured to acquire the target model from the shared memory, process the acquired second to-be-processed real-time data by using the target model to obtain a second processing result, and store the second processing result in the shared memory.

3. The system of claim 1, wherein each task node is provided with a shared memory management service; each data computation sub-node is further configured to:

4. The system of claim 2, wherein each task node is provided with a shared memory management service; each model training sub-node is further configured to:

5. The system of claim 1 or 2, wherein the real-time data computing system is: a distributed real-time data computing system; the model training system is as follows: a distributed model training system; each task node is provided with a state management service, and each data processing sub-node is further used for:

6. The system of claim 5, wherein the number of data sources is plural; before each data processing sub-node performs the specified processing operation on the acquired first to-be-processed real-time data, each data computing sub-node is further configured to:

7. The system of claim 5, wherein after each data computation child node is down and restarted, each data computation child node is further configured to:

8. A model training method is characterized in that the method is applied to any target model training child node in any target task node of a data processing system; wherein the data processing system comprises: at least one task node, each task node comprising: the system comprises at least one data calculation sub-node, at least one model training sub-node and a shared memory, wherein each data sub-node is a sub-node in a real-time data computing system, and each model training sub-node is a sub-node in a model training system; the method comprises the following steps:

9. The method of claim 8, wherein the real-time data computing system is: a distributed real-time data computing system; the model training system is as follows: a distributed model training system; the method further comprises the following steps:

10. The method of claim 9, wherein each task node is provided with a shared memory management server, the method further comprising: