CN115599524A

CN115599524A - Data lake system based on cooperative scheduling processing of streaming data and batch data

Info

Publication number: CN115599524A
Application number: CN202211329376.8A
Authority: CN
Inventors: 彭龙; 杨亮; 杜宏博; 王嘉岩; 葛天恒; 徐天敕; 葛晋鹏; 冯国清; 薛行; 崔琳; 许童
Original assignee: China North Computer Application Technology Research Institute
Current assignee: China North Computer Application Technology Research Institute
Priority date: 2022-10-27
Filing date: 2022-10-27
Publication date: 2023-01-13
Anticipated expiration: 2042-10-27
Also published as: CN115599524B

Abstract

The invention relates to a data lake system based on stream data and batch data collaborative scheduling processing, belonging to the technical field of data processing; the problem that the prior art cannot realize the mixed arrangement of stream data and batch data processing tasks when a data lake is constructed or data in the data lake is processed, and the efficiency is low is solved; the data lake system of the present invention comprises: the system comprises a centralized storage module, a calculation engine module and a data management module; the centralized storage module is used for storing data of each service data source in a classified manner; the data management module is used for arranging data processing tasks and scheduling the data processing tasks of each node based on a preset data processing task cooperative scheduling method; the computing engine module is used for processing the data in the centralized storage module through different computing engines based on the data processing task requirements, and pushing or storing the processed data in the centralized storage module based on the data processing task requirements.

Description

Data lake system based on cooperative scheduling processing of streaming data and batch data

Technical Field

The invention relates to the technical field of data processing, in particular to a data lake system based on cooperative scheduling processing of stream data and batch data.

Background

The Data Lake (Data Lake) is a centralized repository, a database that stores various large-scale raw Data sets in native format, which allows all structured and unstructured Data to be stored at any scale. Big data analytics and big data lakes are evolving towards more types of real-time intelligent services that can support real-time decisions. The use of a data lake can utilize more data from more sources in a shorter time, enabling users to collaboratively process and analyze data in different ways to make better and faster decisions. In the process of constructing the data lake, data processing is used as an important intermediate link from a data source end to the tail end of a data lake server, rapid access, processing and output of large-scale off-line batch data and real-time stream data are supported, and the requirements of mixed arrangement of streaming data processing tasks and batch data processing tasks which are not sensible to a user, improvement of data change efficiency and the like can be realized in a visual mode. Thus, challenges are faced in how to reduce the complexity of data processing task orchestration and how stream data processing and batch data processing can be mixed orchestrated. In the current data processing technology, the mixing arrangement of the stream data processing tasks and the batch data processing tasks is not flexible enough, the complex scene that the stream data processing tasks are mixed and arranged with the batch data processing tasks again cannot be met, and the requirement that a user cannot select the types of the stream data processing tasks or the batch data processing tasks is met in the process of visually developing the data processing tasks.

Disclosure of Invention

In view of the foregoing analysis, the present invention aims to provide a data lake system based on cooperative scheduling processing of stream data and batch data; the problem that in the prior art, data lake-based data scheduling processing efficiency is low, mixed arrangement of streaming data and batch data processing tasks cannot be achieved, and the requirement of a user on non-inductive operation when the user selects the type of the streaming data processing task or the batch data processing task cannot be met is solved.

The purpose of the invention is mainly realized by the following technical scheme:

in one aspect, the present invention provides a data lake system based on cooperative scheduling processing of streaming data and batch data, where the data lake system is constructed based on data of multiple service data sources, and the system includes: the system comprises a centralized storage module, a calculation engine module and a data management module; wherein the content of the first and second substances,

the centralized storage module is used for storing data of each service data source in a classified manner;

the data management module comprises a task flow management unit, the task flow management unit is used for arranging data processing tasks based on application requirements, constructing a DAG directed acyclic graph, and scheduling the data processing tasks of each node in the directed acyclic graph by using a corresponding computing engine in the computing engine module based on a preset data processing task cooperative scheduling method;

the computing engine module is used for processing the data in the centralized storage module through different computing engines based on the requirements of data processing tasks and pushing or storing the processed data in the centralized storage module based on the requirements of the data processing tasks.

Further, the data management module further comprises an access control unit and a data access unit;

the access control unit is used for managing user group authorities, and the authorities comprise a storage authority, a management authority and a use authority;

the data access unit is used for managing data accessed by an external service data source, and marking and classifying the accessed data and constructing a data directory;

further, the application requirements comprise data access requirements for external service data sources and application requirements for processing data in the centralized storage module by users;

the service data source comprises a service database and data collected by real-time data collection equipment.

Further, the data lake system is constructed based on the distributed object storage architecture of S3, wherein the stored data comprises original data, processing process data and metadata.

Further, the preset data processing task cooperative scheduling method includes the following steps:

determining the task type, task state and data state of the data processing task of each node in the directed acyclic graph;

acquiring one or more downstream tasks of a current node, and adapting a data structure required by the downstream tasks based on the task types of the downstream tasks;

and judging whether to start running the downstream task or not based on the task state and the data state of the current node and the task type of the downstream task of the current node so as to carry out cooperative control scheduling on the data processing task.

Further, the task types of the data processing tasks comprise stream data processing tasks and batch data processing tasks;

the task states include: not started, running, completed, failed, and terminated;

for the stream data processing task, after reading a piece of data, setting the data state as: a piece of data has been sent; and after all data are read, setting the data state as follows: all data transmission is completed;

for a batch data processing task, after reading a batch of data, setting the data state as: a batch of data is sent; and after all data are read, setting the data state as follows: all data transmission is complete.

Further, the adapting a data structure required by the downstream task based on the task type of the downstream task includes:

if the downstream task is a batch data processing task, generating data output by the current node into a batch data structure record and outputting the batch data structure record to the downstream task; the batch data structure is a unified package of batch data, and comprises: recordSchema, datafile, and fieldDelimiter;

if the downstream task is a stream data processing task, generating a stream data structure record of the data output by the current node and outputting the stream data structure record to the downstream task; the stream data structure is a unified encapsulation of stream data, comprising: recordSchema, values, and recordBytes.

Further, the generating the data output by the current node into a batch data structure record and outputting the batch data structure record to a downstream task includes:

if the task of the current node is a batch data processing task, directly outputting the data processed by the current node to a downstream task;

and if the current node is a streaming data processing task, creating a batch data structure file, sequentially adding streaming data output by the current node into the batch data structure file, and obtaining a batch data structure record according to whether the output of the streaming data processing task of the current node is finished or a preset threshold value and outputting the batch data structure record to a downstream task.

Further, the generating, by the data output by the current node, a stream data structure record and outputting the stream data structure record to a downstream task includes:

if the task of the current node is a streaming data processing task, directly outputting the data processed by the current node to a downstream task;

and if the task of the current node is a batch data processing task, reading file contents contained in the batch data structure record output by the current node according to lines, converting each piece of data read according to lines into a stream data structure, and outputting the stream data structure record to a downstream task.

Further, the determining whether to start running the downstream task includes:

if the tasks of the current node are all in-progress or completed task states, and: the task types of the task of the current node and the downstream task are both stream data processing tasks, and the data state is that one piece of data is sent or all data is sent; or the task of the current node is a streaming data processing task, the downstream task is a batch processing task, and the data state is that all data transmission is completed; or the task of the current node is a batch processing task, and the data state is that all data reading is completed; executing the downstream task;

otherwise, continuing to execute the task of the current node.

The beneficial effects of the technical scheme are as follows:

the data lake system automatically adapts data structures required by different data processing task types when a stream data processing task and a batch data processing task are mixed and arranged by a method for cooperatively controlling data processing task scheduling based on a data state and a task state, and determines to wait or synchronously run a downstream data processing task according to the data state and the task type; when the tasks are scheduled, the technical problems of data exchange, data structure adaptation and the like among the tasks do not need to be concerned, and the data access and output efficiency is improved. The data lake system supports cleaning, conversion and loading of all-form data such as real-time and offline structured data, unstructured data and the like, can realize complex data exchange among a plurality of service systems and data fusion and sharing of cross-system, cross-department, cross-organization and cross-center, solves the problems of usability, efficiency and the like of the data lake system, and meets the complex scene that a stream data processing task and a batch data processing task are mixed and arranged.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The drawings, in which like reference numerals refer to like parts throughout, are for the purpose of illustrating particular embodiments only and are not to be considered limiting of the invention.

FIG. 1 is a block diagram of a data lake system according to an embodiment of the present invention.

Detailed Description

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate preferred embodiments of the invention and together with the description, serve to explain the principles of the invention and not to limit the scope of the invention.

One embodiment of the present invention discloses a data lake system based on cooperative scheduling processing of stream data and batch data, as shown in fig. 1, the data lake system of this embodiment is constructed based on data of a plurality of service data sources by using an S3 distributed object storage architecture, and the system includes: the system comprises a centralized storage module, a calculation engine module and a data management module; wherein the content of the first and second substances,

the centralized storage module is used for storing the data of each service data source in a classified manner; the data stored therein includes raw data, process data and metadata.

Preferably, the service data source of this embodiment may be a database set by the service department according to its service requirement, that is, the service database of the service department in the enterprise includes a set of all service data of the corresponding service department. The embodiment supports real-time or off-line full-form data such as structured data, unstructured data and the like. For example, structured data tables such as an address data table, an employee information data table and an employee attendance data table of employees in a personnel department; various financial analysis data and other structured data of the financial department; the material management related data of the production department comprises unstructured data such as specification and quality report of the material; meanwhile, the service data may also include stream data collected in real time by monitoring equipment and the like.

The data management module comprises a task flow management unit, the task flow management unit is used for arranging data processing tasks based on application requirements of users, constructing a DAG directed acyclic graph, and scheduling the data processing tasks of each node in the directed acyclic graph by using a corresponding computing engine in the computing engine module based on a preset data processing task cooperative scheduling method;

specifically, the task flow management unit may receive an application requirement of the user, and preferably, the application requirement may include a data access requirement for an external service data source, which is used to receive data of the external service data source, and store the data into the data lake after corresponding processing. The application requirements also comprise application requirements of processing the data in the centralized storage module by a user, the user can perform processing operations such as analysis, calculation, machine learning and the like by using the data in the data lake according to the application requirements, and the processed data can be stored in the data lake or pushed to an external system according to the requirements.

After receiving the application requirement of the user, the task flow management unit can utilize the existing data flow arrangement tool (such as DataFlow) to arrange the data processing tasks according to the application requirement, construct a DAG directed acyclic graph by taking the data processing tasks as nodes and taking the task flow direction as edges, and schedule the data processing tasks of each node by a data processing task cooperative scheduling method.

As a specific embodiment, for the application requirement of elevator fault risk real-time prediction, data needs to be acquired in real time through an elevator detection sensor, the sensor data is transmitted through an MQTT protocol, and the running state parameters of elevator components are monitored through model prediction; the data processing task flow arrangement is carried out through the task flow management unit, and the data processing task flow arrangement method comprises the following steps: tasks such as data reading, data conversion, model prediction, distributed message queue writing, distributed object storage writing and the like; firstly, reading data from a sensor in real time by taking an MQTT data reading task as a starting node; sending the read data to a data conversion task node, calling a data conversion calculation engine in a calculation engine module, and converting the data format into a parameter format required by an elevator fault risk prediction machine learning model; inputting the converted data into a model prediction node, and predicting through an elevator fault risk prediction machine learning model in a calculation engine module; writing the prediction result into a distributed message queue through a 'distributed message queue write-in' node for the upstream service to apply and consume the prediction result and make a real-time decision; and finally, writing the prediction result into a centralized storage module of a data lake through a distributed object storage writing node to serve as partial data of the data lake, persistently storing the prediction result for subsequent off-line analysis, serving as a training data set for optimizing an elevator fault risk prediction machine learning model, and pushing the training data set to a terminal system of a user according to needs.

Preferably, the preset data processing task cooperative scheduling method includes the following steps:

step S1: determining the task type, task state and data state of the data processing task of each node in the directed acyclic graph; wherein the content of the first and second substances,

the task types of the data processing tasks comprise stream data processing tasks and batch data processing tasks;

the task states include: not started, running, completed, failed, and terminated; specifically, a data processing task starts to run, and the task state is converted into running from the beginning; if all the service logics of the data processing task are normally executed, the task state is converted into the completed state; the task logic execution is abnormal, and the task state is converted from running to failure; and (4) externally triggering the flow to terminate operation, and switching the task state from running to termination. And when the task state is 'failure' or 'termination', finishing task flow scheduling, and finally, setting the task state as 'failure'.

Further, for the stream data processing task, after reading a piece of data, the data state is set as: a piece of data has been sent; and after all data are read, setting the data state as follows: all data transmission is completed;

for a batch data processing task, after reading a batch of data, setting the data state as: a batch of data has been sent; and after all data are read, setting the data state as follows: all data transmission is completed.

Step S2: acquiring one or more downstream tasks of a current node, and adapting a data structure required by the downstream tasks based on the task types of the downstream tasks;

in particular, the data structure includes a batch data structure and a stream data structure, wherein,

the batch data structure is a unified package of batch data, comprising: recordSchema, datafile, and fieldDelimiter; wherein the content of the first and second substances,

the RecordSchema is a data format describing a "batch data structure" and a "stream data structure", and contains at least one field, each field including: fieldName: a field name; defaultValue: a field default value; dataType: data types of the fields, including character strings, numerical values, dates; isprimary key: whether the primary key.

The Datafile is a text type data file, each line of data is a record, and one line of data is divided according to fieldDelimiter;

fieldDelimiter is a row data separator such as comma, middle vertical line, etc.

The stream data structure is a unified encapsulation of stream data, comprising: recordSchema, values, and recordBytes, wherein,

the RecordSchema is the same as the RecordSchema in the batch data structure;

values are arrays constructed according to the number of fields and the types of the fields, and each field value is stored;

recordBytes is the number of bytes of a record and is used for counting the size of the data stream.

Preferably, if the downstream task is a batch data processing task, the data output by the current node is generated into a batch data structure record and then output to the downstream task; the method comprises the following steps: if the task of the current node is a batch data processing task, directly outputting the data processed by the current node to a downstream task; and if the current node is a streaming data processing task, creating a batch data structure file, sequentially adding streaming data output by the current node into the batch data structure file, and obtaining a batch data structure record according to whether the output of the streaming data processing task of the current node is finished or a preset threshold value and outputting the batch data structure record to a downstream task. More specifically, a RecordSchema attribute of the stream data structure is set into a RecordSchema of the batch data structure; and (3) taking value data content of the stream data structure as a separator according to a fieldDelimiter value, splicing each field content into a character string to be used as a line of data record, and adding the line of data record into the batch data structure record to obtain the batch data structure record.

If the downstream task is a stream data processing task, generating a stream data structure record of the data output by the current node and outputting the stream data structure record to the downstream task; the method comprises the following steps: if the task of the current node is a streaming data processing task, directly outputting the data processed by the current node to a downstream task; if the task of the current node is a batch data processing task, reading file contents contained in a batch data structure record output by the current node according to lines, and converting each piece of data read according to lines into a stream data structure, specifically: directly setting a RecordSchema attribute of the batch data structure into a RecordSchema of the stream data structure; reading a line of character strings of the file content according to lines, splitting the line of character strings into a character string array according to a fieldDelimiter separator, setting the value of each field corresponding to the content of the array on the Values attribute, acquiring the size of the line of character strings, and setting the size on the recordBytes attribute to obtain the stream data structure record.

And step S3: and judging whether to start running the downstream task or not based on the task state and the data state of the current node and the task type of the downstream task of the current node so as to carry out cooperative control scheduling on the data processing task.

Specifically, if the tasks of the current node are all in-execution or completed task states, and: the task types of the task of the current node and the downstream task are both stream data processing tasks, and the data state is that one piece of data is sent or all data is sent; or the task of the current node is a streaming data processing task, the downstream task is a batch processing task, and the data state is that all data transmission is completed; or the task of the current node is a batch processing task, and the data state is that all data transmission is finished; executing the downstream task;

otherwise, continuing to execute the task of the current node.

More specifically, when judging whether to start running downstream tasks, one or more downstream tasks of the current node are obtained as data processing tasks to be run; circularly acquiring each downstream task and respectively acquiring one or more upstream tasks thereof, and if the upstream tasks have an 'unfixed' state, circularly acquiring the upstream tasks;

if the task states of the upstream tasks are all "running" or "complete", then:

if the upstream task is a streaming data processing task, the task to be operated is a batch data processing task, and the data state of the upstream task is not 'all data transmission is completed', circulating the next task in the upstream task;

if the upstream task is a batch data processing task and the data state is not 'one batch of data is sent' or 'all data is sent completely', circulating the next task in the upstream task;

otherwise, automatically adapting the data structure required by the downstream task and running the downstream data processing task.

If the task state of failure or termination occurs, the task flow scheduling is ended, and the final task state is failure.

As a specific embodiment, when a data processing task flow starts, a node without an upstream task in a job task flow chart is searched as a start node, and a first data processing task node is run, for example, a "MySQL data reading task" and a "FTP file downloading task" are searched, and the two tasks are started to run;

for the MySQL data reading task, the task state is changed into running, after the first piece of data is read, the data state is 'sent one piece of data', the downstream task is searched, the downstream task is 'data deduplication task', the 'data deduplication task' is a batch processing task, and batch deduplication calculation can be started only when all data are sent completely, so that the 'data deduplication task' state is 'not started'.

For the FTP file downloading task, the task state is changed into 'running', when a first file is downloaded, the data state is 'sent data', a downstream task is searched, the downstream task is 'file content analysis task', the task is a stream processing task, the 'file content analysis task' can be run, the state is 'running', when the first data of the file is analyzed and sent to the downstream, the data state is 'sent data', the downstream task is continuously searched, and the downstream task is 'data association task'. Because the data association task is a batch processing task, the operation can be started only when the data transmission of the upstream task, namely the data deduplication task and the file content analysis task, is finished and the task state is successful;

waiting for the data state of the MySQL data reading task to be ' all data transmission is completed ' and the task state is ' success ', starting to run a ' data deduplication task ', automatically converting the ' data deduplication task into a batch data structure record required by a batch processing task, and performing data deduplication processing;

waiting for the data state of the data deduplication task and the file content analysis task at the upstream of the data association task to be ' all data transmission completion ' and the task state to be ' success ', and starting to run the data association task ';

when the data state of the data association task is 'all data transmission is completed' and the task state is 'success', starting to search the downstream task, obtaining that the downstream task is 'write-in Oracle database task' and starting to run, wherein the 'write-in Oracle database task' is a stream processing task, automatically converting the output result of the upstream 'data association task' into a stream data structure and inputting the stream data structure into the 'write-in Oracle database task';

when the data state of the 'write Oracle database task' is 'all data transmission is completed', the task state is 'success', and because no downstream task exists, the whole task scheduling process is completed, and the state is 'success';

if the task is abnormal in the execution process, for example, the data type is not correct, the data cannot be empty, and the like, the task scheduling process is terminated, and the state is 'failure'.

Furthermore, the data management module also comprises an access control unit and a data access unit;

the access control unit is used for managing user group authority, and the authority comprises a storage right, a management right and a use right; preferably, the user authority management function of the S3 distributed object storage architecture can be used for authority management.

The data access unit is used for managing data accessed by an external service data source, and comprises marking and classifying the accessed data and constructing a data directory. Specifically, the accessed data is classified and marked according to the data structure type, the data form, the updating period and the like; wherein, the data structure type can be divided into: structured, semi-structured, unstructured, etc. types; the method can be divided into the following steps according to data forms: data forms such as data tables, pictures, videos, audios and texts; the method can be divided into the following steps according to the updating period: non-updating, non-regular updating, real-time, minute, hour, day, week, month, quarter, half year, one year and other updating modes; after the data are classified and labeled, a data directory is constructed for the accessed data so as to organize and manage the data, and the data are better used for functions such as searching, statistics and the like.

The computing engine module is used for processing the data in the centralized storage module through different computing engines based on the data processing task requirements, and pushing or storing the processed data in the centralized storage module based on the data processing task requirements. Preferably, the calculation engine module may include calculation engines such as data reading, data cleaning, data conversion, data loading, data writing, machine learning, and the like, and may adopt an existing calculation engine, or may integrate a calculation engine that is autonomously developed and trained, and the like.

The data lake system realizes the data structure required by automatically adapting to different data processing task types through the data processing task scheduling method based on different service data sources, realizes the noninductive operation of users when the stream data processing tasks and batch data processing tasks are mixed and arranged, improves the efficiency of data access and output, and further realizes the complex data exchange among a plurality of service systems and the fusion and sharing of data among cross systems, cross departments, cross organizations and cross centers.

In summary, the data lake system based on cooperative scheduling processing of streaming data and batch data cooperatively controls the data processing tasks through the data state and the task state, automatically adapts to data structures required by different data processing task types when the streaming data processing tasks and the batch data processing tasks are mixedly arranged, and determines to wait or synchronously run downstream data processing tasks according to the data state and the task type; when arranging tasks, a user only needs to care about service logic, does not need to care about technical problems of data exchange, data structure adaptation and the like among the tasks, improves data access and output efficiency, solves the problems in the aspects of usability, efficiency and the like, meets the complex scene of mixing and arranging the streaming data processing tasks and the batch data processing tasks, and realizes a convenient and efficient data lake system.

Those skilled in the art will appreciate that all or part of the processes for implementing the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, for instructing the relevant hardware. The computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory, etc.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. A data lake system based on cooperative scheduling processing of stream data and batch data is characterized in that the data lake system is constructed based on data of a plurality of service data sources, and the system comprises: the system comprises a centralized storage module, a calculation engine module and a data management module; wherein the content of the first and second substances,

the centralized storage module is used for storing the data of each service data source in a classified manner;

the computing engine module is used for processing the data in the centralized storage module through different computing engines based on the data processing task requirements, and pushing or storing the processed data in the centralized storage module based on the data processing task requirements.

2. The data lake system of claim 1, wherein the data management module further comprises an access control unit and a data access unit;

the data access unit is used for managing data accessed by an external service data source, and comprises marking and classifying the accessed data and constructing a data directory.

3. The data lake system of claim 1, wherein the application requirements comprise data access requirements for external business data sources and user application requirements for processing data in the centralized storage module;

4. The data lake system of claim 1, wherein the data lake system is constructed based on an S3 distributed object storage architecture, wherein the stored data comprises raw data, process data, and metadata.

5. The data lake system of claim 1, wherein the preset data processing task collaborative scheduling method comprises the following steps:

and judging whether to start running the downstream task or not based on the task state and the data state of the current node and the task type of the downstream task of the current node so as to carry out cooperative control scheduling of the data processing task.

6. The data lake system of claim 5,

for a batch data processing task, after reading a batch of data, setting the data state as: a batch of data is sent; and after all data are read, setting the data state as follows: all data transmission is completed.

7. The data lake system of claim 5, wherein adapting the data structure required by the downstream task based on the task type of the downstream task comprises:

8. The data lake system of claim 7, wherein the generating of the data output by the current node as a batch data structure record and outputting the data to a downstream task comprises:

9. The data lake system of claim 7, wherein the outputting the data output by the current node after being generated into the stream data structure record to the downstream task comprises:

10. The data lake system of any one of claims 5 to 9, wherein the determining whether to begin running a downstream task comprises:

otherwise, continuing to execute the task of the current node.