CN113821538A

CN113821538A - Streaming data processing system based on metadata

Info

Publication number: CN113821538A
Application number: CN202110996670.3A
Authority: CN
Inventors: 陶志强; 魏晟坤; 蒲凌云; 马新成
Original assignee: Chinaccs Information Industry Co ltd
Current assignee: Chinaccs Information Industry Co ltd
Priority date: 2021-08-27
Filing date: 2021-08-27
Publication date: 2021-12-21
Anticipated expiration: 2041-08-27
Also published as: CN113821538B

Abstract

The invention provides a streaming data processing system based on metadata, and belongs to the technical field of data processing. The technical scheme is as follows: a stream data processing system based on metadata comprises a metadata management module, a stream data processing module and a stream data processing module, wherein the metadata management module can define the structure of a meta model according to the form and the storage mode of data; the flow management module comprises a flow management component and a flow arrangement component; the flow scheduling module can acquire a flow execution diagram, distribute execution tasks according to the flow execution diagram and send corresponding execution instructions; and the flow execution module can receive the execution instruction, execute the corresponding execution task and start the corresponding operator to start calculation. The method has the advantages that metadata definition, process definition and process arrangement are achieved through a visualization method, and a stream processing process is created through a dragging component and model-driven logic.

Description

Streaming data processing system based on metadata

Technical Field

The invention relates to the technical field of data processing, in particular to a streaming data processing system based on metadata.

Background

At present, with the rapid development of the internet and the internet of things technology, information browsing and retrieval, data reporting of internet of things equipment, electronic commerce, the internet and many other common living products are all online. The real-time requirement is further improved, information interaction and communication are developing from point to information chains and even information networks, so that cross correlation of data in each dimension is inevitably brought, data explosion is inevitable, and therefore streaming processing is carried forward, and the problems of a real-time framework and large-scale data calculation are solved.

At present, various big data technologies and stream processing technologies need to be mastered in real-time processing, and the requirement on skills of developers is high. When the streaming data processing services are more, unified management is lacked, services rapidly evolve, requirements rapidly change, and designing a new data processing flow or modifying an existing flow cannot be implemented rapidly.

In view of the rapid development of the current industries such as intelligent security, intelligent city, intelligent agriculture and intelligent traffic, a large amount of data is accumulated, the convergence and treatment of the data are usually completed by each platform, a large amount of repeated work is generated, the data is lack of management, and the data is difficult to be fully utilized.

A set of general streaming data processing systems suitable for various industries is needed, and aggregation interface, storage, management, push, sharing and visualization of data can be dynamically completed through configuration of metadata. In an enterprise, corresponding metadata exists in places where the data exist, the data can be better understood only if complete and accurate metadata exists, and the value of the data is fully mined, so that research and development of industrial application can focus on the business itself instead of complex data processing and governing aspects.

Disclosure of Invention

In view of the above problems in the prior art, it is an object of the present invention to provide a metadata-based streaming data processing system, which uses a drag component and model-driven logic to create a streaming processing flow for metadata definition, flow definition, and flow arrangement through a visualization method.

The invention is realized by the following technical scheme: a stream data processing system based on metadata comprises a metadata management module, a stream data processing module and a stream data processing module, wherein the metadata management module can define the structure of a meta model according to the form and the storage mode of data; the flow management module comprises a flow management component and a flow arrangement component, wherein the flow arrangement component is embedded with a flow arrangement interactive interface, visual flow arrangement is carried out on the flow arrangement interactive interface through a drag-and-drop operator, and a flow type processing flow and corresponding flow metadata are formed after arrangement is finished; the process management component can acquire the process metadata and analyze the process metadata into a process execution diagram;

the flow scheduling module can acquire a flow execution diagram, distribute execution tasks according to the flow execution diagram and send corresponding execution instructions;

and the flow execution module can receive the execution instruction, execute the corresponding execution task and start the corresponding operator to start calculation.

Further, the meta-model includes: an interface meta-model, a logical meta-model, a physical meta-model; defining a structure of corresponding metadata according to the meta-model, the metadata including: interface metadata, logical metadata, physical metadata.

Further, the operator is implemented by using Flink or Springboot, including: the system comprises a convergence operator, a conversion operator, a distribution operator, an aggregation operator and a push operator.

Further, the specific steps of the visual process arrangement are as follows in sequence: 1. dragging and dropping the configuration information of the source end data and the convergence operator to the process arrangement interactive interface, and associating the source end data with interface metadata to obtain an interface model; 3. dragging and dropping the conversion operator, automatically acquiring an upper model as an input model, acquiring a lower model as an output model if the lower model exists, matching logic metadata to acquire a selected model if the lower model does not exist, configuring the relation between the input model and the output model by using imaging, and performing field association through connecting lines to form a mapping rule; 4. dragging and dropping the distribution operator, and configuring an input model and an output model of the aggregation operator to realize the operation of aggregation windowing; 5. dragging and dropping a push operator, configuring an output model, and reading a target physical model if target data exists; 6. drag and drop configuration information of the target data, configure a physical model, a target data type and target data connection information; 7. and (3) connecting the steps 1-6 through a graphical operation to form a flow type treatment process.

Further, a task execution program is embedded in the flow execution module, and the task execution program receives an execution request, starts to execute a corresponding operator, and sends an execution parameter to the operator; after receiving the execution parameters, the operator communicates with a task execution program through a Rest protocol to acquire interface metadata, logic metadata, physical metadata and a mapping rule; after the operator acquires the metadata, automatically generating SQL (structured query language) of conversion codes and storage results according to the metadata, and starting to calculate by using a stream processing engine; and after the operator is executed, sending a completion or failure instruction to the task execution program, and transmitting a calculation result to an operator operated by the next node.

Further, the process management module further comprises a monitoring alarm component, the task execution program monitors the execution state of the operator at regular time and feeds the execution state back to the monitoring alarm component; and after receiving the completion or failure command, the task execution program reports information to the monitoring alarm component in real time.

A streaming data processing method of a metadata-based streaming data processing system, comprising the steps of:

s1, defining a meta-model structure according to the type and storage mode of the source data, and defining a meta-data structure according to the meta-model;

s2, creating a flow type processing flow, and arranging the flow by arranging and dragging operators through a visual flow;

s3, selecting a conversion operator in the flow processing flow, and carrying out graphical mapping field configuration on an input model and an output model of the conversion operator to generate a mapping rule;

s4, the configured flow processing flow is on-line;

s5, executing the streaming processing flow, and sending a starting command and flow metadata to the flow management component;

s6, after receiving the starting command, the flow management component analyzes the flow metadata into a flow execution diagram and sends the flow execution diagram to the flow scheduling module;

s7, the flow scheduling module allocates one or more execution nodes to the flow execution module according to the flow execution diagram, and sends an execution request to the task execution program on the node;

s8, the task execution program receives the execution request, executes the corresponding operator and sends the execution parameter to the operator;

s9, after receiving the execution parameters, the operator program communicates with the task execution program through a Rest protocol to obtain interface metadata, logic metadata, physical metadata and mapping rules;

s10, after the operator program obtains the metadata, automatically generating a conversion code and SQL storing a result according to the metadata, and starting to calculate;

s11, the task execution program monitors the execution state of the operator at regular time and feeds the execution state back to the monitoring alarm component;

s12, after the operator program is executed, sending a completion or failure command to the task executor, and transmitting the result set to the operator operated by the next node;

and S13, after receiving the completion or failure command, the task execution program reports the information to the monitoring and warning component in real time.

The invention has the beneficial effects that: the data gathering, storage, management, pushing, data sharing and data visualization can be dynamically completed through metadata configuration and graphical process arrangement; when a new stream data processing service needs to be on-line, the unified and efficient stream data processing service can be provided only by metadata definition, visual flow arrangement and stream task running, the service can be on-line quickly, the delivery cycle is shortened, the running condition of the stream task is monitored, and the enterprise maintenance cost is effectively reduced. In addition, the method can also meet the rapid evolution of business requirements, quickly implement new data processing procedures and modify the existing procedures, and continuously improve the production system; the method can lead the developer to focus the energy of application research and development on the business, and is not a complicated stream processing link; the invention can reduce the threshold of developers, quickly construct real-time computing application, and also can ensure that business personnel with non-technical background can construct streaming processing, and for large enterprises, the invention can also reduce the initial cost of IT team training and technical deployment.

Drawings

FIG. 1 is a system result block diagram.

FIG. 2 is a schematic diagram of a meta-model structure.

Fig. 3 is a schematic diagram of an operator execution process.

Fig. 4 is a schematic diagram of the operation process inside the operator.

FIG. 5 is a schematic flow chart of the method.

FIG. 6 is an interface metadata example.

FIG. 7 is a logical metadata example.

Fig. 8 is an example of physical metadata.

Fig. 9 is a physical metadata two example.

FIG. 10 is a visualization flow chart of orchestration.

FIG. 11 is a diagram of a graphical mapping field configuration.

FIG. 12 is a target data example.

Detailed Description

In order to clearly illustrate the technical features of the present solution, the present solution is explained below by way of specific embodiments.

The first embodiment, referring to fig. 1-5, is realized by the following technical scheme: a stream data processing system based on metadata comprises a metadata management module, a stream data processing module and a stream data processing module, wherein the metadata management module can define the structure of a meta model according to the form and the storage mode of data; the meta-model includes: an interface meta-model, a logical meta-model, a physical meta-model; defining a structure of corresponding metadata according to the meta-model, the metadata including: interface metadata, logical metadata, physical metadata;

the management of metadata can help enterprise personnel clearly see which data the enterprise has, what position exists respectively, helps the data dictionary of clearance enterprise simultaneously, quick inquiry and location data. As shown in fig. 2, an interface package is defined, in which meta-models of the interface model are defined, including meta-metadata of the interface set and meta-metadata of the JSON interface model. Or a database is defined, a meta model of the relational database is defined, and meta metadata of the schema, the table, the view, the stored procedure, the function, the program package, the column, the primary key and the index are defined. They directly define the association in a combined manner. The Schema consists of a table, a view, a function, a storage process and a program package; the table consists of columns, primary keys, and indexes. The interface set consists of interface models. In addition, the meta-model can also establish the own meta-model according to the needs of people;

the flow management module comprises a flow management component and a flow arrangement component, wherein the flow arrangement component is embedded with a flow arrangement interactive interface, visual flow arrangement is carried out on the flow arrangement interactive interface through a drag-and-drop operator, and a flow type processing flow and corresponding flow metadata are formed after arrangement is finished; the process management component can acquire the process metadata and analyze the process metadata into a process execution diagram;

As shown in fig. 3, each operator is an executable program, and can be implemented by using different underlying technologies, such as Flink or Springboot, and includes: the system comprises a convergence operator, a conversion operator, a distribution operator, an aggregation operator and a push operator. The convergence operator can provide various modes to obtain data, can self-define the convergence operator to support more protocols, such as data access of an Http protocol Rest, data access of an FTP protocol, data access of a Websocket protocol and the like, can also perform functions of integrated authentication, safety encryption, gateways and the like, and carries out analysis and verification on the entered data according to an interface model defined by metadata and sends the data to kafka to be processed by the next node; the distribution operator receives data in the upstream message queue, copies multiple data or sends the data to topic with different message queues in a load balancing mode, a plurality of downstream operators can be connected subsequently, processing results are different when the data received by the downstream operators are different, the distribution operator can also directly land the data in a data lake, and the database can be a database or a non-relational database. The stream processing operator is used for carrying out specific calculation on data, can be realized in different ways, and can also be expanded in a user-defined way to realize a specific stream processing operator, and the operations of aggregation, Map conversion, column splitting into multiple lines, column transmission of lines, repeated record removal, field selection, character string replacement, sorting, script-based processing, formula-based processing, field encryption, field decryption, data sampling and the like are carried out.

The Map conversion operator is one of stream processing operators, an input model and an output model are firstly set, and field connecting lines of the input model and the output model can be configured in a graphical mode, so that a conversion relation is established, and when data passes through the Map conversion operator, the input model can be converted into the output model. The aggregation operator is one of stream processing operators, firstly grouping fields and an aggregation window, then setting aggregation fields, setting aggregation methods such as summation, average, maximum, minimum, number and the like for fields except for grouping, aggregating data streams into new data, and delivering the new data to the next node for processing.

The push operator can push data received upstream into a database, into a file, and also support http protocol, ftp protocol, and websocket protocol. Other protocols can be developed by themselves. The operator pushes data to a specific storage device, and can be a database or a platform interface of a third party.

The working process inside the operator is shown in FIG. 4, and any data enters an operator program; selecting data content identified according to the interface metadata by a Map mapping operator program, converting the data content into a logic model according to a mapping rule, and delivering a result set to the next node or nodes for continuous processing; and identifying the data content of the upstream node by an aggregation operator program according to the logic metadata, aggregating the data into a physical model according to a mapping rule, and storing the physical model into a corresponding database.

The visual process arrangement comprises the following specific steps in sequence: 1. dragging and dropping the configuration information of the source end data and the convergence operator to the process arrangement interactive interface, and associating the source end data with interface metadata to obtain an interface model; 3. dragging and dropping the conversion operator, automatically acquiring an upper model as an input model, acquiring a lower model as an output model if the lower model exists, matching logic metadata to acquire a selected model if the lower model does not exist, configuring the relation between the input model and the output model by using imaging, and performing field association through connecting lines to form a mapping rule; 4. dragging and dropping the distribution operator, and configuring an input model and an output model of the aggregation operator to realize the operation of aggregation windowing; 5. dragging and dropping a push operator, configuring an output model, and reading a target physical model if target data exists; 6. drag and drop configuration information of the target data, configure a physical model, a target data type and target data connection information; 7. and (3) connecting the steps 1-6 through a graphical operation to form a flow type treatment process. Different operators are combined according to business requirements, and different computing functions can be realized.

A task execution program is embedded in the flow execution module, receives an execution request, starts to execute a corresponding operator, and sends an execution parameter to the operator; after receiving the execution parameters, the operator communicates with a task execution program through a Rest protocol to acquire interface metadata, logic metadata, physical metadata and a mapping rule; after the operator acquires the metadata, automatically generating SQL (structured query language) of conversion codes and storage results according to the metadata, and starting to calculate by using a stream processing engine; and after the operator is executed, sending a completion or failure instruction to the task execution program, and transmitting a calculation result to an operator operated by the next node.

In connection with fig. 3, the execution flow of the physical layer of the streaming process flow is described: the convergence operator can access a variety of data sources such as: the system comprises a web-socket, HTTP, FTP and the like, and is also responsible for authentication, safety and gateway functions, a convergence operator sends data to a message queue Kafka, a downstream distribution operator receives the message and then distributes the data to different topic Kafka, branching is carried out, a downstream flow type processing operator can further carry out aggregation operation on the data after receiving the data and sends the data to a downstream push operator, and the push operator can also realize the data push to different target data, such as HTTP, FTP, a database, a written-out file, web-socket and the like.

As shown in fig. 1, the process management module further includes a monitoring alarm component, and the task execution program monitors the execution state of the operator at regular time and feeds back the execution state to the monitoring alarm component; and after receiving the completion or failure command, the task execution program reports information to the monitoring alarm component in real time.

As shown in fig. 5 to fig. 12, in the second embodiment, a streaming data processing method based on the present system is practically applied, which specifically includes the following steps:

step 1: using a built-in JSON interface meta-model and a relational database meta-model;

step 2: defining interface metadata: see FIG. 6;

and step 3: defining logical metadata; see FIG. 7;

and 4, step 4: defining physical metadata; see fig. 8, 9;

step 5, creating a new processing flow, which comprises the following contents:

the process name is as follows: personnel data aggregation

The flow state is as follows: draft;

step 6, arranging drag and drop operators by using the visual process, and arranging the process; see FIG. 10;

step 7, opening a model relation in the attribute by using the selected operator needing conversion operation to carry out graphical mapping field configuration on the module; see FIG. 11;

step 8, clicking an online button;

step 9, clicking an execution flow button;

step 10, delivering target data to a convergence gateway interface by using postman; see FIG. 12;

step 11 looks at the databases associated with the target data 1 and the target data 2, and can see that the apploye 1 and the apploye 2 in the 2 target data have the data in step 10.

The big data stream processing method of the present invention has reliability, high scalability, high fault tolerance, and embodiments of the present invention are intended to cover all such substitutions, modifications, and variations that fall within the broad scope of the appended claims. Therefore, any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

In the description of the present invention, the foregoing detailed description has set forth various embodiments of the devices and/or processes via the use of block diagrams, flowcharts, and/or examples. To the extent that such block diagrams, flowcharts, and/or examples contain one or more functions and/or operations, those skilled in the art will appreciate that each function and/or operation within such block diagrams, flowcharts, or examples can be implemented, individually and/or collectively, by a wide range of different hardware, software, firmware, or virtually any combination thereof.

There is little difference between hardware and software implementations of aspects of the system; the use of hardware or software is typically (but not always, since in some scenarios the choice between hardware and software may become important) a design choice representing a cost versus efficiency tradeoff. There are various means (e.g., hardware, software, and/or firmware) by which processes and/or systems and/or other techniques described herein can be implemented, and the preferred means will vary from one scenario in which processes and/or systems and/or other techniques are deployed to another. For example, if the implementer determines that speed and accuracy are paramount, the implementer may opt for a mainly hardware and/or firmware approach; if flexibility is paramount, the implementer may opt for a mainly software implementation; alternatively, but again alternatively, the implementer may opt for some combination of hardware, software, and/or firmware.

Claims

1. A metadata-based streaming data processing system, comprising:

the metadata management module can define the structure of the meta-model according to the form and the storage mode of the data;

2. The metadata-based streaming data processing system of claim 1, wherein the meta-model comprises: an interface meta-model, a logical meta-model, a physical meta-model; defining a structure of corresponding metadata according to the meta-model, the metadata including: interface metadata, logical metadata, physical metadata.

3. The metadata-based streaming data processing system of claim 2, wherein the operator comprises: the system comprises a convergence operator, a conversion operator, a distribution operator, an aggregation operator and a push operator.

4. The metadata-based streaming data processing system according to claim 3, wherein the visualization process arrangement comprises the following specific steps in sequence: 1. dragging and dropping the configuration information of the source end data and the convergence operator to the process arrangement interactive interface, and associating the source end data with interface metadata to obtain an interface model; 3. dragging and dropping the conversion operator, automatically acquiring an upper model as an input model, acquiring a lower model as an output model if the lower model exists, matching logic metadata to acquire a selected model if the lower model does not exist, configuring the relation between the input model and the output model by using imaging, and performing field association through connecting lines to form a mapping rule; 4. dragging and dropping the distribution operator, and configuring an input model and an output model of the aggregation operator to realize the operation of aggregation windowing; 5. dragging and dropping a push operator, configuring an output model, and reading a target physical model if target data exists; 6. drag and drop configuration information of the target data, configure a physical model, a target data type and target data connection information; 7. and (3) connecting the steps 1-6 through a graphical operation to form a flow type treatment process.

5. The metadata-based streaming data processing system according to claim 4, wherein a task execution program is embedded inside the flow execution module, and the task execution program receives an execution request, starts executing a corresponding operator, and sends an execution parameter to the operator; after receiving the execution parameters, the operator communicates with the task execution program to obtain interface metadata, logic metadata, physical metadata and mapping rules; the operator starts to calculate after acquiring the metadata; and after the operator is executed, sending a completion or failure instruction to the task execution program, and transmitting a calculation result to an operator operated by the next node.

6. The metadata-based streaming data processing system according to claim 5, wherein the flow management module further comprises a monitoring alarm component, and the task execution program monitors the execution state of the operator at regular time and feeds the execution state back to the monitoring alarm component; and after receiving the completion or failure command, the task execution program reports information to the monitoring alarm component in real time.

7. The streaming data processing method of the metadata based streaming data processing system according to claim 6, comprising the steps of:

s4, the configured flow processing flow is on-line;