CN116401025A

CN116401025A - Data processing system and data processing method

Info

Publication number: CN116401025A
Application number: CN202310248860.6A
Authority: CN
Inventors: 赵浩霖; 杨伟康
Original assignee: Hangzhou Ezviz Software Co Ltd
Current assignee: Hangzhou Ezviz Software Co Ltd
Priority date: 2023-03-10
Filing date: 2023-03-10
Publication date: 2023-07-07

Abstract

The embodiment of the application provides a data processing system and a data processing method, which are applied to the technical field of information, and by the system, a management control unit can generate tasks to be executed according to data source information, output target information and task configuration operation; and selecting a target master data computing unit for executing the task to be executed by the distributed task coordination unit, and executing the task to be executed by utilizing the target master and slave data computing units, thereby realizing the processing of the data. Aiming at different service demands and data sources, corresponding tasks to be executed can be generated through corresponding data source information, output target information and task configuration operation, a data processing system is not required to be developed again, the data processing efficiency can be improved, and the data processing cost is reduced.

Description

Data processing system and data processing method

Technical Field

The present disclosure relates to the field of information technologies, and in particular, to a data processing system and a data processing method.

Background

With the advent of the internet of things, many intelligent devices capable of internet interaction appear, when a user interacts with the devices, device time sequence data can be generated, and processing of such data is usually performed by a professional big data technical engineer developing a corresponding data processing system according to service requirements so as to calculate the big data in real time and output service results in real time.

The service demands of users are various and changeable, the data sources are various, and corresponding data processing systems are developed for each service, so that the efficiency is low, and the cost of data processing is high.

Disclosure of Invention

An object of an embodiment of the present application is to provide a data processing system and a data processing method, so as to solve at least one of the above problems. The specific technical scheme is as follows:

in a first aspect of embodiments of the present application, there is provided a data processing system, the system comprising:

the system comprises a management control unit, a distributed task coordination unit and a plurality of data calculation units, wherein the plurality of data calculation units comprise a plurality of master data calculation units and a plurality of slave data calculation units;

the management control unit is used for acquiring data source information, output target information and task configuration operation input by a user; generating an input source table and an output target table according to the data source information and the output target information; generating a task to be executed according to the task configuration operation, the input source table and the output target table;

the distributed task coordination unit is used for selecting a target main data calculation unit for executing the task to be executed from the plurality of main data calculation units;

The management control unit is further configured to issue the task to be executed to the target main data computing unit;

the target master data computing unit is used for selecting a plurality of target slave data computing units from the plurality of slave data computing units according to the task to be executed; generating sub-tasks to be executed of the target slave data computing units according to the tasks to be executed, and issuing corresponding sub-tasks to be executed to the target slave data computing units respectively;

the target slave data computing unit is used for acquiring data to be processed according to the sub-task to be executed received by the target slave data computing unit and processing the data to be processed to obtain a data processing sub-result;

the target main data computing unit is further configured to obtain the data processing sub-results sent by the target slave data computing unit, and aggregate the data processing sub-results to obtain a data processing result.

In one possible implementation, the task configuration operation includes a structured query language statement of a business;

the management control unit is specifically configured to verify the structured query language statement, the input source table, and the output target table; and under the condition that the verification passes, generating a task to be executed according to the structured query language statement, the input source table and the output target table.

In one possible implementation, the task configuration operation includes a drag operation and a parameter input operation;

the management control unit is specifically configured to display the input source table, the output target table and a preset aggregation calculation function; in response to the drag operation, defining parameters in a specified preset aggregate computing function as specified items in the input source table and specified items in the output target table respectively; responding to parameter input operation, and assigning a value to a specified parameter in the specified preset aggregation calculation function; converting the appointed preset aggregation calculation function into a structured query language statement; checking the structured query language statement, the input source table and the output target table; and under the condition that the verification passes, generating a task to be executed according to the structured query language statement, the input source table and the output target table.

In a possible implementation manner, the management control unit is specifically configured to perform syntax and parameter verification on the input source table, and perform data pull verification according to the input source table when the syntax and parameter verification of the input source table passes; performing grammar and parameter verification on the output target table, and performing data output verification according to the output target table under the condition that the grammar and the parameter verification of the output target table pass; and carrying out grammar and parameter verification on the structured query language statement.

In one possible embodiment, the task to be executed is a work operation diagram;

the management control unit is specifically configured to construct a plurality of subtasks according to the structured query language statement, the input source table and the output target table when the verification passes; generating a work running chart based on each subtask; and sending the working operation diagram to the target main data computing unit in the form of byte stream.

In a possible implementation manner, the management control unit is specifically configured to combine the sub-tasks of the same data source into one combined task; and arranging and processing special fields for the combined task and the non-combined subtasks to generate a work running chart.

In a possible implementation manner, the management control unit is further configured to generate a registry of data sources of the combined task;

the target main data computing unit is used for executing the combined task, and is particularly used for acquiring data to be processed from a registry of a data source of the combined task.

In one possible implementation, the data processing system further includes: a metadata storage unit;

the target main data calculation unit is further used for generating execution state information of the task to be executed and sending the execution state information to the management control unit;

The management control unit is further configured to store the input source table, the output destination table, and the execution state information in the metadata storage unit.

In one possible embodiment, the system further comprises: a data storage unit and a data display unit;

the data storage unit is used for receiving and storing the data processing result sent by the target main data calculation unit;

the data display unit is used for acquiring the data processing result from the data storage unit and displaying the data processing result on a designated interface.

In a second aspect of the embodiments of the present application, a data processing method is provided and applied to a data processing system, where the data processing system includes a management control unit, a distributed task coordination unit, and a plurality of data calculation units, where the plurality of data calculation units includes a plurality of master data calculation units and a plurality of slave data calculation units; the method comprises the following steps:

the management control unit acquires data source information, output target information and task configuration operation input by a user; generating an input source table and an output target table according to the data source information and the output target information; generating a task to be executed according to the task configuration operation, the input source table and the output target table;

The distributed task coordination unit selects a target main data calculation unit for executing the task to be executed from the plurality of data calculation units;

the management control unit issues the task to be executed to the target main data computing unit;

the target master data computing unit selects a plurality of target slave data computing units from the plurality of slave data computing units according to the task to be executed; generating sub-tasks to be executed of the target slave data computing units according to the tasks to be executed, and issuing corresponding sub-tasks to be executed to the target slave data computing units respectively;

the target receives the subtasks to be executed from the data calculation unit according to the target, acquires data to be processed and processes the data to be processed to obtain a data processing subtask;

the target main data computing unit obtains the data processing sub-results sent by the target slave data computing unit, and gathers the data processing sub-results to obtain a data processing result.

In one possible implementation, the task configuration operation includes a structured query language statement of a business; generating a task to be executed according to the task configuration operation, the input source table and the output target table; comprising the following steps:

The management control unit checks the structured query language statement, the input source table and the output target table;

and under the condition that the verification passes, generating a task to be executed according to the structured query language statement, the input source table and the output target table.

In one possible implementation, the task configuration operation includes a drag operation and a parameter input operation; generating a task to be executed according to the task configuration operation, the input source table and the output target table; comprising the following steps:

the management control unit displays the input source table, the output target table and a preset aggregation calculation function;

in response to the drag operation, defining parameters in a specified preset aggregate computing function as specified items in the input source table and specified items in the output target table respectively;

responding to parameter input operation, and assigning a value to a specified parameter in the specified preset aggregation calculation function;

converting the appointed preset aggregation calculation function into a structured query language statement;

checking the structured query language statement, the input source table and the output target table;

In a possible implementation manner, the verification is performed on the structured query language statement, the input source table and the output target table; comprising the following steps:

the management control unit performs grammar and parameter verification on the input source table, and performs data pulling verification according to the input source table under the condition that the grammar and the parameter verification of the input source table pass;

performing grammar and parameter verification on the output target table, and performing data output verification according to the output target table under the condition that the grammar and the parameter verification of the output target table pass;

and carrying out grammar and parameter verification on the structured query language statement.

In one possible embodiment, the task to be executed is a work operation diagram; the issuing the task to be executed to the master data computing unit and the slave data computing unit based on the result selected by the computing unit includes:

the management control unit establishes a plurality of subtasks according to the structured query language statement, the input source table and the output target table under the condition that the verification passes;

Generating a work running chart based on each subtask;

and sending the working operation diagram to the target main data computing unit in the form of byte stream.

In a possible implementation manner, the generating a work running chart is based on each subtask; comprising the following steps:

the management control unit merges all the subtasks of the same data source into a combined task; and arranging and processing special fields for the combined task and the non-combined subtasks to generate a work running chart.

In one possible embodiment, the method further comprises:

the management control unit generates a registry of the data sources of the combined task;

and the target main data calculation unit is used for executing the combined task and acquiring data to be processed from a registry of a data source of the combined task.

In one possible implementation, the data processing system further includes: a metadata storage unit; the method further comprises the steps of:

the target main data calculation unit generates execution state information of the task to be executed and sends the execution state information to the management control unit;

the management control unit stores the input source table, the output target table, and the execution state information in the metadata storage unit.

In one possible embodiment, the system further comprises: a data storage unit and a data display unit; the method further comprises the steps of:

the data storage unit receives and stores the data processing result sent by the target main data calculation unit;

the data display unit acquires the data processing result from the data storage unit and displays the data processing result on a designated interface.

The beneficial effects of the embodiment of the application are that:

the data processing system and the data processing method provided by the embodiment of the application can be used for acquiring the data source information, the output target information and the task configuration operation input by a user through the management control unit; generating an input source table and an output target table according to the data source information and the output target information; generating a task to be executed according to the task configuration operation, the input source table and the output target table; the distributed task coordination unit is used for selecting a target main data calculation unit for executing the task to be executed from the plurality of main data calculation units; the management control unit is further configured to issue the task to be executed to the target main data computing unit; the target master data computing unit is used for selecting a plurality of target slave data computing units from the plurality of slave data computing units according to the task to be executed; generating sub-tasks to be executed of the target slave data computing units according to the tasks to be executed, and issuing corresponding sub-tasks to be executed to the target slave data computing units respectively; the target slave data computing unit is used for acquiring data to be processed according to the sub-task to be executed received by the target slave data computing unit and processing the data to be processed to obtain a data processing sub-result; the target main data computing unit is further configured to obtain the data processing sub-results sent by the target slave data computing unit, and aggregate the data processing sub-results to obtain a data processing result. By applying the system of the embodiment of the application, the management control unit can generate the task to be executed according to the data source information, the output target information and the task configuration operation; and selecting a target master data computing unit for executing the task to be executed by the distributed task coordination unit, and executing the task to be executed by utilizing the target master and slave data computing units, thereby realizing the processing of the data. Aiming at different service demands and data sources, corresponding tasks to be executed can be generated through corresponding data source information, output target information and task configuration operation, a data processing system is not required to be developed again, the data processing efficiency can be improved, and the data processing cost is reduced.

Of course, not all of the above-described advantages need be achieved simultaneously in practicing any one of the products or methods of the present application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following description will briefly introduce the drawings that are required to be used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other embodiments may also be obtained according to these drawings to those skilled in the art.

FIG. 1 is a schematic diagram of a data processing system according to an embodiment of the present application;

FIG. 2 is a schematic diagram of input data source information according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a DDL statement defined by a data source table creation provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of output target table information according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a DDL statement defined by a table structure of an output target table according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a task configuration operation provided in an embodiment of the present application;

FIG. 7 is another schematic diagram of a task configuration operation provided in an embodiment of the present application;

FIG. 8 is a flowchart for checking an input source table or an output target table according to an embodiment of the present application;

FIG. 9 is a schematic flow chart of the logic for checking an input source table or an output target table according to the embodiment of the present application;

FIG. 10 is a flowchart of verifying a structured query language statement provided in an embodiment of the present application;

FIG. 11 is a flow chart of the logic for verifying a structured query language statement provided in an embodiment of the present application;

fig. 12 is a flowchart of a data processing method according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. Based on the embodiments herein, a person of ordinary skill in the art would be able to obtain all other embodiments based on the disclosure herein, which are within the scope of the disclosure herein.

For a clearer explanation of the technical solutions of the present application, the following is an explanation of technical terms that may be used in the present application:

the internet (internet), also known as an international network, refers to a vast network of networks connected in series, with a set of common protocols, to form a logically single vast international network.

The internet starts with apanet. The Internet is generally referred to as the Internet, which is in particular the Internet. This method of interconnecting computer networks together may be referred to as "internetworking" and has evolved to cover the world's global internetworking, i.e., the interconnected network structure. The internet is not the same as the world wide web, which is just a global system based on hypertext links, and is one of the services that the internet can provide.

The internet of things (Internet of Things, ioT for short) refers to collecting any object or process needing to be monitored, connected and interacted in real time through various devices and technologies such as various information sensors, radio frequency identification technologies, global positioning systems, infrared sensors and laser scanners, collecting various needed information such as sound, light, heat, electricity, mechanics, chemistry, biology and positions, and realizing ubiquitous connection of objects and people through various possible network access, and realizing intelligent sensing, identification and management of objects and processes. The internet of things is an information carrier based on the internet, a traditional telecommunication network and the like, and enables all common physical objects which can be independently addressed to form an interconnection network.

Big data, or huge amount of data, refers to information that the amount of data involved is so large that it cannot be retrieved, managed, processed, and consolidated in a reasonable time through the mainstream software tools, thus helping the business decision to be more aggressive.

Cloud computing (clouding) is one type of distributed computing, which refers to decomposing a huge data computing process program into numerous small programs through a network "cloud", and then processing and analyzing the small programs through a system composed of multiple servers to obtain results and returning the results to users. Early cloud computing, simply referred to as simple distributed computing, solves task distribution, and performs merging of computing results. Thus, cloud computing is also known as grid computing. By this technique, processing of tens of thousands of data can be completed in a short time (several seconds), thereby achieving a powerful network service.

Cloud computing refers to a very powerful system formed by a computer network (multi-finger internet), which can store, aggregate and configure related resources as needed, and provide personalized services to users. With the development of internet technology, today's cloud service is not just a distributed computing, but is a result of hybrid evolution and jump of computer technologies such as distributed computing, utility computing, load balancing, parallel computing, network storage, hot backup redundancy and virtualization.

Stream calculation, stream calculation: in conventional data processing flows, data is always collected and then placed into a database. When people need, the data is queried through the database to obtain an answer or perform related processing. This seems to be very reasonable but the results are very compact, especially in some real-time search application environments, where off-line processing like the MapReduce approach does not solve the problem well. This leads to a new data computation structure-stream computation mode. It can analyze large-scale flow data in real time during the continuous changing movement process, capture the information which can be useful, and send the result to the next computing node.

Apache Flink is an open source stream processing framework developed by the Apache software Foundation, the core of which is a distributed stream data stream engine written in Java and Scala. The Flink executes any stream data program in a data parallel and pipeline manner, and the pipeline runtime system of the Flink can execute batch processing and stream processing programs. Furthermore, the runtime itself of the flank also supports the execution of the iterative algorithm.

With the rapid development of computer technology, various business data of various industries are exponentially and rapidly increased in aspects of society, the data storage and calculation amount is very large, the mass data stored and calculated are actually tested, the data type and structure tend to be complicated, the structure and type are various, the structure and type comprise commonly used structured data, semi-structured data, unstructured data such as audio and video data, streaming media data, vector or graph data and the like. The method and the system have the advantages that the challenges are brought to the resource storage and calculation of the Internet, but currently, users participating in networking services have experience on the requirements of the services and simultaneously put forward higher requirements, such as recommending commodities or resources which are more in line with the wishes of the users for the users, the real-time requirements of the user experience are higher, and the sensitivity of the users to information and data is higher.

Meanwhile, with the rapid development of a new Internet of things, including intelligent equipment Internet of things, internet of vehicles and the like, when facing to the demands of users, an enterprise not only communicates and interacts with the users through web pages or handheld terminal equipment in the Internet age, but also interacts with the basic Internet of things objects or equipment at the side in real time, and certainly, in the Internet of things age in the new situation, equipment time sequence data is generated along with the interaction between the users and the equipment, and the data is an explosive type growth compared with the Internet age.

The time sequence data of various devices in the Internet of things age are synchronized with a local area network or a cloud (possibly an edge) in real time (or near real time), are calculated together with big data in real time, and after calculation, service results are output in real time, and different service demands are responded to users. Thus, under this scenario, new demands and challenges are also presented for big data computation. At the same time, new demands and challenges will also be increasing for large data practitioners. This increases the professional requirements for entering the big data industry and increases the admission threshold. This is detrimental to the development of the big data industry in the long run, while one industry and development requires more talents to participate, while its technology and tools need to be practically verified in more business scenarios.

In a first aspect of an embodiment of the present application, there is provided a data processing system, as shown in fig. 1, including:

a management control unit 101, a distributed task orchestration unit 102, and a plurality of data calculation units 103, wherein the plurality of data calculation units include a plurality of master data calculation units 1031 and a plurality of slave data calculation units 1032;

a management control unit 101, configured to obtain data source information, output target information, and task configuration operation input by a user; generating an input source table and an output target table according to the data source information and the output target information; and generating a task to be executed according to the task configuration operation, the input source table and the output target table.

The distributed task orchestration unit 102 is configured to select a target main data computing unit for executing a task to be executed from the plurality of main data computing units 1031.

The management control unit 101 is further configured to issue a task to be executed to the target main data computing unit.

A target master data calculating unit for selecting a plurality of target slave data calculating units among the plurality of slave data calculating units 1032 according to the task to be performed; and respectively generating sub-tasks to be executed of each target slave data computing unit according to the tasks to be executed, and respectively issuing corresponding sub-tasks to be executed to each target slave data computing unit.

The target slave data calculation unit is used for acquiring the data to be processed according to the sub-task to be executed received by the target slave data calculation unit and processing the data to be processed to obtain a data processing sub-result.

The target main data calculation unit is also used for acquiring the data processing sub-results sent by the target slave data calculation unit, and summarizing the data processing sub-results to obtain the data processing results.

By applying the system of the embodiment of the application, the management control unit can generate the task to be executed according to the data source information, the output target information and the task configuration operation; and selecting a target master data computing unit for executing the task to be executed by the distributed task coordination unit, and executing the task to be executed by utilizing the target master and slave data computing units, thereby realizing the processing of the data. Aiming at different service demands and data sources, corresponding tasks to be executed can be generated through corresponding data source information, output target information and task configuration operation, a data processing system is not required to be developed again, the data processing efficiency can be improved, and the data processing cost is reduced.

The management control unit, the distributed task coordination unit, and the data calculation unit in the embodiments of the present application are described in detail below.

In one example, the management control unit may perform the construction configuration by installing JDK8 (Java SE Development Kit, JAVA programming language development), tomcat8 (Web application server with open source code), a system installation war package (compressed package), and the like, specifically, may perform the configuration of relevant system parameters and environment parameters after installing JDK8, tomcat8, and system installation war package, respectively start the service processes of each node, observe the start state, and confirm the successful installation through the Web interface (software interface) after confirming the successful start. The process of installing and constructing the management control unit can refer to related technologies, and is not repeated in the application.

The data source information is configuration registration information of an input source table, and the data source information input by a user may include a data structure definition of source data, a table structure definition DDL (Data Definition Languages, data definition language) statement of the source data, connector related parameters defined by the source data, a JSON (JavaScript Object Notation, JS object profile) structure definition after conversion of the source data, a field definition after conversion of the source data, and the like, as shown in fig. 2, and the user inputs the data source information in an interface through a selection or input mode. The source data is data in a data source, and the definition of a data structure of an input source table in data source information may refer to the following table 1, it is understood that the definition of the data structure in table 1 is only an example, and those skilled in the art may set the data structure according to actual situations, which is still within the protection scope of the present application.

TABLE 1

In one example, the DDL references for defining the table structure of the source data may be as shown in fig. 3, the DDL statement for creating the definition of the data source table is divided into two parts, where the first part is the DDL statement for creating the data source table structure, mainly defines the table name, the field name and the field type of the source data table, and may define the time field, the mechanism for generating the watermark, and so on; in fig. 3, kafka Connector is taken as an example, and relevant parameters of the Connector are configured by designating a version, a topic (theme), a consumption mode of starting data, a connection address of a Kafka cluster browser (message agency), a consumption group name, an inverse sequence Format (method for formatting) and the like.

The output destination information is configuration registration information of the output destination table, and similar to the manner of the input data source information, as shown in fig. 4, reference may also be made to the above table 1 when configuring the data structure, where the output destination information must include: the table structure of the output target table defines DDL statements, and the connector parameters of the output target table are configured. In particular, reference may be made to fig. 5, where the first portion is a DDL statement defined for the table structure of the output destination table and the second portion is to configure the connector parameters of the output destination table. In configuring the Connector parameters, the relevant parameters of the Connector are also configured by designating the version of Kafka, topic (theme), consumption mode of start data, connection address of Kafka cluster browser (message broker), consumption group name, anti-serialization Format (method for formatting), etc. taking Kafka Connector as an example.

The task configuration operation may have various information and modes for configuring the task, for example, a data structure for configuring the task, a language structure for configuring the task, an input mode for configuring the task, and the like. Specifically, the data structure of the task may be referred to as shown in the following table 2:

TABLE 2

The distributed coordination unit can select the target main data calculation unit according to the task to be executed or the description information of the task to be executed in combination with the type, load, network and the like of the calculation unit. For example, for tasks with more parallel operations, such as image processing tasks, a data computing unit of a GPU (Graphics Processing Unit, graphics processor) may be selected for computation, and for tasks with less parallel operations but complex computation, a data computing unit of a CPU (Central Processing Unit ) may be selected for computation. In addition, the distributed coordination unit can select a target main data calculation unit according to the residual calculation resources of the data calculation unit, network conditions and the like, and monitor the work of each task node (data calculation unit) through the target main data calculation unit so as to ensure the data consistency of distributed operation. The specific selection manner of the target master data computing unit may refer to a related selection manner of master-slave nodes in a distributed service or cluster, which is not specifically limited in the present application. After selecting the target master data computing unit, the target master data computing unit may select the target slave data computing unit according to a predetermined allocation rule, where the predetermined allocation rule generally employs a Hash rule (Hash rule).

When the distributed coordination unit is built, the distributed coordination unit can be built by installing a distributed cluster, for example, an Apache Zookeeper cluster, a hadoop cluster (Apache hadoop, distributed file management system), redis (Remote Dictionary Server, remote dictionary service) and the like are adopted, for example, the Apache Zookeeper cluster is installed on a server to build the distributed coordination unit, and a three-node deployment mode is adopted when the distributed coordination unit is built, wherein a version of the Apache Zookeeper deployment uses JDK version which is dependent on the Apache Zookeeper 3.4.6,Apache Zookeeper to use JDK1.8+, and the installation deployment and management use of the distributed coordination unit can refer to related technologies.

After the distributed coordination unit selects the target main data computing unit, the target main data computing unit selects the target slave data computing unit for executing the subtasks, and the management control unit can find the corresponding target main data computing unit and target slave data computing unit through the description information of the target main data computing unit and the description information of the target slave data computing unit, further sends the tasks to be executed to the target main data computing unit, and then sends the subtasks to be executed to the target slave data computing unit through the target main data computing unit. In practical applications, the management control unit needs to maintain a connection with the data computing unit, for example, whether the management control unit and the data computing unit maintain a connection can be monitored by maintaining heartbeat detection, so as to ensure the connection activity of the data computing unit.

The target master data computing unit and the target slave data computing unit obtain the data to be processed from a data source, for example, the data source may be a hard disk, an online database, a message queue middleware, a file system, a relational and non-relational database, a search engine cluster, and the like. The target main data computing unit and the target slave data computing unit can extract data to be processed from a data source in the process of executing a task to be executed and a subtask to be executed, clean the extracted data to be processed according to task requirements, select source data participating in analysis and computation, and finally obtain a data processing result.

In one example, when building a data computing unit, the data computing unit may be built by installing a data computing cluster, where the data computing cluster is of multiple types, for example, a store cluster (Apache storage, data stream processing system), a kafka, hadoop, apache link cluster (open source stream processing framework).

By the system, the data source information and the output target information can be configured in advance to generate the input source table and the output target table meeting the requirements, so that a user can generate related tasks to be executed only by inputting the information meeting the requirements, the technical requirements on the user are further reduced, and finally the personnel cost for data processing is reduced.

In one possible implementation, the task configuration operation includes a structured query language (Structured Query Language, SQL) statement of the business;

the management control unit is specifically used for verifying the structured query language statement, the input source table and the output target table; and under the condition that the verification passes, generating a task to be executed according to the structured query language statement, the input source table and the output target table.

The management control unit may receive a task configuration operation performed by a user by inputting an SQL statement. In practical application, the management control unit can perform grammar and parameter verification on the SQL sentence, the input source table and the output target table, and when the grammar and parameter verification of the SQL sentence, the input source table and the output target table are all passed, the addition of the SQL sentence is completed, and a task to be executed is generated according to the SQL sentence, the input source table and the output target table. Specifically, as shown in fig. 6, after parameters corresponding to task configuration operations are added and corresponding options are selected, a Flink SQL statement to be calculated in an aggregation manner is input in an input box.

By the system, the task to be executed can be generated by the system only by the fact that a user is familiar with the grammar of the SQL sentence through dynamic configuration operation and through grammar verification, so that professional requirements on the user are reduced, and the personnel cost of data processing is reduced.

the management control unit is specifically used for displaying an input source table, an output target table and a preset aggregation calculation function; in response to a drag operation, defining parameters in a specified preset aggregate calculation function as specified items in an input source table and specified items in an output target table, respectively; responding to the parameter input operation, and assigning a value to a specified parameter in a specified preset aggregation calculation function; converting the appointed preset aggregation calculation function into a structured query language statement; checking the structured query language statement, the input source table and the output target table; and under the condition that the verification passes, generating a task to be executed according to the structured query language statement, the input source table and the output target table.

In practical application, the data source information or the output target information can be input by dragging and selecting the input source table, the output target table and the preset aggregation calculation function by a mouse. As shown in fig. 7, the user only needs to select a designated preset aggregate calculation function for calculation from a plurality of preset aggregate calculation functions through mouse operation; selecting specified items from the input source table and the output target table, defining parameters in a specified preset aggregation calculation function as the specified items, for example, defining parameters a in the specified preset aggregation calculation function as age items in the input source table and the like; in addition, the user can assign a value to the specified parameter in the specified preset aggregation calculation function through the parameter input operation; and finally, converting the appointed preset aggregate computing function of the parameter definition and the parameter assignment into a corresponding SQL statement.

By applying the system of the embodiment of the application, a user can automatically complete the input of the SQL sentence only by a dragging mode, and does not need to know the grammar environment of the SQL sentence, so that the professional requirements on the user are further reduced, and the personnel cost of data processing is reduced.

In one possible implementation manner, the management control unit is specifically configured to perform syntax and parameter verification on the input source table, and perform data pull verification according to the input source table when the syntax and parameter verification of the input source table passes; carrying out grammar and parameter verification on the output target table, and carrying out data output verification according to the output target table under the condition that the grammar and the parameter verification of the output target table pass; and carrying out grammar and parameter verification on the structured query language statement.

In practice, the verification of the input source table and the output destination table may be performed by the steps of figure 8,

step S801: data source information (or output target information) input by a user is received, and relevant parameters are configured.

Step S802: and carrying out validity check on the parameters configured by the input source table (or the output target table).

Step S803: and carrying out pull verification on the data input into the source table (or selecting one piece of data output from the output table for verification).

Step S804: and sending the verification result to the user in real time.

The data of the input source table is pulled and verified, and one piece of data is output in the selected output table for verification, which can be randomly selected from one piece of input data or output data for verification. It should be noted that, after the grammar and the parameters of the structured query language sentence pass through, the output verification is performed on the output target table.

Specifically, as shown in fig. 9, after the data source information or the output target information input by the user is obtained, the verification logic performs grammar verification and parameter verification, and if the verification is not passed, the data source information or the output target information is returned to be modified; if the verification is passed, pulling a piece of data from the data source for verification or acquiring an output piece of data from the output target table for verification; if the verification is passed, an input source table and an output target table are generated according to the data source information and the output target information; if not, returning to continuously modifying the data source information and outputting the target information or deleting the data source information and outputting the target information.

In validating structured query language statements, one can proceed through the steps shown in FIG. 10:

Step S1001: and acquiring the SQL sentence input by the user.

Step S1002: and checking the SQL sentence and sending the check result to the user.

Step S1003; and under the condition that the verification is passed, generating a task to be executed according to the structured query language statement, the input source table and the output target table.

Specifically, as shown in fig. 11, the verification logic of the method is that firstly, the SQL sentence input by the user is received, whether the SQL sentence is legal or not is verified, and if not, the SQL sentence is returned to be modified; under the condition that verification is passed, generating a task to be executed according to the structured query language statement, the input source table and the output target table; and identifying whether the task is started successfully, if so, checking the output target information, and if not, returning to continuously modifying the SQL sentence or deleting the task.

By applying the system of the embodiment of the application, the input source table, the output target table and the structured query language statement can be checked to ensure that the task to be executed can be generated according to the input source table, the output target table and the structured query language statement, so that the accuracy of the task to be executed is improved, and the data processing efficiency is improved.

In one possible embodiment, the task to be performed is a work diagram;

The management control unit is specifically used for constructing a plurality of subtasks according to the structured query language statement, the input source table and the output target table under the condition that the verification passes; generating a work running chart based on each subtask; the working operation diagram is sent to the target main data calculation unit in the form of byte stream.

Specifically, the data computing unit may be built by installing an Apache link cluster, and before installing and deploying the stream computing cluster, the conditions on which the data computing unit depends need to be checked, where the deployment dependent conditions of the Apache link are: before installation, the Apache Zookeeper cluster needs to be installed, and then the Apache Zookeeper cluster needs to be selected when the distributed coordination unit is built. After each environmental parameter of the Apache link is configured, a Apache Flink JobManager (coordinator of the link system) node cluster is started, then Apache Flink TaskManager (executor of the link system) working nodes are respectively started, and the task manager nodes are confirmed to be successfully started and registered to the cluster in a WEB interface of Apache Flink JobManager, so that the construction of a data computing unit is completed.

In practical application, the Job execution diagram may be generated by constructing a corresponding Job execution diagram through a link request, and after constructing the Job execution diagram of the link SQL task, submitting the Job execution diagram through a link client pipeline, where when constructing the Job execution diagram, a data source table may be constructed as a DataStreamSource, and then the DataStreamSource is registered as a temporary table.

In practical application, a Job operation diagram is constructed, then the operation diagram is converted into byte stream, after the byte stream of the Job operation diagram is submitted through an operation pipeline of a Flink REST API, the Flink SQL task is started and operated in a Flink cluster, and in this way, the operation diagram is sent to a target main data computing unit in the form of byte stream in an operation mode.

By applying the system of the embodiment of the application, the traditional method that the flank program or SQL sentence is built into one jar package at the client can be avoided, then the task is submitted and started to run in a mode that the jar package is submitted by the client, and the data is subjected to stream calculation processing in a byte stream mode, so that the data can be analyzed in real time, useful information can be captured, and the accuracy of data processing is improved. By applying the system of the embodiment of the application, a set of environment for constructing Java and constructing jar packages is not required to be built in a client, so that construction and deployment cost and difficulty are reduced, and meanwhile, too much operation processing on tasks in the Flink cluster is not required to be considered when the Flink cluster is deployed in a streaming mode, and the Flink cluster is used for concentrating on calculation, so that more operation management work of the Flink cluster and the tasks is moved into a management control unit. And meanwhile, the management control unit integrates the functions of management, monitoring and the like of the Flink cluster.

In one possible implementation, the management control unit is specifically configured to combine the sub-tasks of the same data source into one combined task; and arranging and processing special fields for the combined task and the non-combined subtasks to generate a work running chart.

In practical applications, there may be multiple tasks executed simultaneously, and some tasks may be acquired from the same data source, in which case the data source needs to be acquired repeatedly once each time the task is executed, so that the task running efficiency becomes low. Therefore, the data source of each subtask can be identified, the tasks of the same data source are combined into one combined task, and a work running chart is generated aiming at the combined task, so that when the combined task is executed, all the subtasks in the combined task can be executed by acquiring the data source once.

By applying the system provided by the embodiment of the application, the running mode of data processing can be optimized, a plurality of tasks are combined into one combined task, the waste of flow caused by repeated data pulling can be avoided, and the waste of computational power resources can be reduced.

In a possible implementation, the management control unit is further configured to generate a registry of data sources for the combined task;

The target main data computing unit is used for executing the combined task, and is particularly used for acquiring the data to be processed from the registry of the data source of the combined task.

In practical application, the data sources corresponding to the task running diagrams of the combined tasks can be generated into a registry and stored in the system, and the target main data calculation unit can directly acquire the data to be processed from the registry without accessing the external data sources to acquire the data to be processed, so that the data processing efficiency is improved.

By applying the system of the embodiment of the application, a plurality of tasks can share the same data source, and the repeated use of the stream is optimized, so that the flow waste behavior of repeated data pulling and consuming is avoided, and the waste of calculation resources can be reduced.

the target main data calculation unit is also used for generating the execution state information of the task to be executed and sending the execution state information to the management control unit;

and the management control unit is also used for storing the input source table, the output target table and the execution state information into the metadata storage unit.

In practical application, the metadata storage unit can be deployed and built by using open-source MySQL (relational database), and generally, in order to ensure high availability, a cluster with separate read and write and multiple masters and multiple slaves can be built, and meanwhile, the database needs to be ensured to only open permission in a local area network, and no service is provided for the internet.

After receiving the execution state information of the task to be executed, the management control unit may call an API of the relational database to write the API into the metadata storage unit. Wherein the execution state information may include execution state information of a plurality of steps, for example, an abnormality or an error in generating a work running chart is encountered in the course of executing a task; the node communication abnormality or error, the abnormality or error possibly encountered in the task submitting process, the abnormality or error possibly encountered in the task running process, the system running abnormality or error possibly encountered in the running process, and the like can be the execution state information. After the Job running chart of the flank SQL task is built, the Job running chart is submitted through a flank client pipeline, after the Job running chart is submitted successfully, the Job ID of the starting task is returned through Flink Client REST API, the Job ID is stored in a metadata storage unit, and meanwhile, the task with synchronous state can store the running state of the task in the metadata storage unit in real time. And the generated input source table and output target table are also stored in the metadata storage unit, so that in the verification step, verification can be performed only by reading the input source table and the output target table from the metadata storage unit.

In practical application, the management control unit may provide a WEB interface for the user to perform input operation and display the running state, and after writing the execution state information into the metadata storage unit, the management control unit may further read the execution state information through the API interface, and display the execution state information on the WEB interface, so that the user may perform corresponding operations such as start, stop, edit, and delete.

By the system, the input source table, the output target table and the execution state information can be stored in the metadata storage unit, so that the management control unit can read the execution state information through the metadata storage unit and display the execution state information on the interface, a user can acquire the task running state in real time, and observation and detection are facilitated or the system is checked when the system fails.

the data display unit is used for acquiring the data processing result from the data storage unit and displaying the data processing result on the designated interface.

In practical applications, the data storage unit is a distributed time sequence database storage service and system related to the connector specified in the output destination table, which generally refers to an elastiscsearch (search engine), a ClikckHouse (column-based storage database), an Hbase (distributed, column-oriented open source database), an OpenTSDB (scalable time sequence database), an InfluxDB (open source time sequence database), and other databases supporting time sequence data storage, and may also directly output a message queue middleware, such as Apache Kafka, apache actigq (message-based communication middleware), and the like, or may also directly output a cache middleware, such as rediss, and may even directly output to a FileSystem (file system) file system.

In building the data storage unit, the data storage unit may be built by installing the elastic search distributed cluster and each node using any one of the above as a data storage engine, for example, using an elastic search cluster depending on JDK.

The data display unit can display the result data in the data storage unit on a designated interface through a tool or a system for the user to observe and analyze. For example, using Grafana (open source program for visualizing large measurement data) developed in Go language, relying on MySQL database as a data presentation system, various data sources are added in Grafana, such as: elasticSearch, openTSDB, clickHouse, and then displaying the data processing result on a designated interface through the graph component configuration of Grafana.

In practical application, corresponding calculation index data can be configured in the data display unit to verify data accuracy of SQL calculation logic and rules, wherein related operations are that corresponding data sources are configured in Grafana data sources, then a panel is newly built in Grafana, and corresponding index configuration is performed in the panel according to an operation manual of the data sources of an appointed output table.

By applying the system, the data processing result can be displayed on the interface for the user to analyze, so that the friendliness of the data processing system is improved.

In a second aspect of the embodiments of the present application, a data processing method is provided, which is applied to a data processing system, where the data processing system includes a management control unit, a distributed task coordination unit, and a plurality of data calculation units, where the plurality of data calculation units includes a plurality of master data calculation units and a plurality of slave data calculation units; the method comprises the steps as shown in fig. 12:

step S1201: the management control unit acquires data source information, output target information and task configuration operation input by a user; generating an input source table and an output target table according to the data source information and the output target information; generating a task to be executed according to the task configuration operation, the input source table and the output target table;

step S1202: the distributed task coordination unit selects a target main data calculation unit for executing a task to be executed from a plurality of data calculation units;

step S1203: the management control unit issues a task to be executed to the target main data calculation unit;

step S1204: the target master data computing unit selects a plurality of target slave data computing units from the plurality of slave data computing units according to the task to be executed; generating sub-tasks to be executed of each target slave data computing unit according to the tasks to be executed, and respectively issuing corresponding sub-tasks to be executed to each target slave data computing unit;

Step S1205: the target receives the subtasks to be executed from the data calculation unit according to the target, acquires the data to be processed and processes the data to be processed to obtain a data processing subtask;

step S1206: and acquiring data processing sub-results sent by the data computing units by the targets, and summarizing the data processing sub-results to obtain data processing results.

By the method, the data source information and the output target information can be configured in advance to generate the input source table and the output target table meeting the requirements, so that a user can generate related tasks to be executed only by inputting the information meeting the requirements, the technical requirements on the user are further reduced, and finally the personnel cost for data processing is reduced.

In one possible implementation, the task configuration operation includes a structured query language statement of the business; generating a task to be executed according to the task configuration operation, the input source table and the output target table; comprising the following steps:

By the method, through dynamic configuration operation, a user only needs to be familiar with SQL sentence grammar, and through grammar verification, tasks to be executed can be generated through the system, so that professional requirements on the user are reduced, and the personnel cost of data processing is reduced.

the management control unit displays an input source table, an output target table and a preset aggregation calculation function;

in response to a drag operation, defining parameters in a specified preset aggregate calculation function as specified items in an input source table and specified items in an output target table, respectively;

responding to the parameter input operation, and assigning a value to a specified parameter in a specified preset aggregation calculation function;

By applying the method of the embodiment of the application, a user can automatically complete the input of the SQL sentence only by a dragging mode, and does not need to know the grammar environment of the SQL sentence, so that the professional requirements on the user are further reduced, and the personnel cost of data processing is reduced.

In one possible implementation, the structured query language statement, the input source table, and the output target table are verified; comprising the following steps:

carrying out grammar and parameter verification on the output target table, and carrying out data output verification according to the output target table under the condition that the grammar and the parameter verification of the output target table pass;

By applying the method of the embodiment of the application, the input source table, the output target table and the structured query language statement can be checked to ensure that the task to be executed can be generated according to the input source table, the output target table and the structured query language statement, so that the accuracy of the task to be executed is improved, and the data processing efficiency is improved.

In one possible embodiment, the task to be performed is a work diagram; based on the selection result of the computing unit, issuing tasks to be executed to the master data computing unit and the slave data computing unit, wherein the tasks comprise:

under the condition that the verification passes, the management control unit constructs a plurality of subtasks according to the structured query language statement, the input source table and the output target table;

generating a work running chart based on each subtask;

the working operation diagram is sent to the target main data calculation unit in the form of byte stream.

By applying the method of the embodiment of the application, the traditional method that the flank program or SQL sentence is built into one jar package at the client can be avoided, then the task is submitted and started to run in a mode that the jar package is submitted by the client, and the data is subjected to stream calculation processing in a byte stream mode, so that the data can be analyzed in real time, useful information can be captured, and the accuracy of data processing is improved. By applying the system of the embodiment of the application, a set of environment for constructing Java and constructing jar packages is not required to be built in a client, so that construction and deployment cost and difficulty are reduced, and meanwhile, too much operation processing on tasks in the Flink cluster is not required to be considered when the Flink cluster is deployed in a streaming mode, and the Flink cluster is used for concentrating on calculation, so that more operation management work of the Flink cluster and the tasks is moved into a management control unit. And meanwhile, the management control unit integrates the functions of management, monitoring and the like of the Flink cluster.

In one possible implementation, a work running graph is generated based on the subtasks; comprising the following steps:

By applying the method of the embodiment of the application, the running mode of data processing can be optimized, a plurality of tasks are combined into one combined task, the waste of flow caused by repeated data pulling can be avoided, and the waste of computational power resources can be reduced.

In one possible embodiment, the method further comprises:

the management control unit generates a registry of data sources of the combined task;

and the target main data calculation unit is used for executing the combined task and acquiring the data to be processed from the registry of the data source of the combined task.

By applying the method of the embodiment of the application, a plurality of tasks can share the same data source, and the repeated use of the stream is optimized, so that the flow waste behavior of repeated data pulling and consuming is avoided, and the waste of calculation resources can be reduced.

In one possible implementation, the data processing system further includes: a metadata storage unit; the data processing method further comprises the following steps:

and a management control unit for storing the input source table, the output target table and the execution state information in the metadata storage unit.

By the method, the input source table, the output target table and the execution state information can be stored in the metadata storage unit, so that the management control unit can read the execution state information through the metadata storage unit and display the execution state information on the interface, a user can acquire the task running state in real time, and observation and detection are facilitated or the system is checked when the system fails.

the data display unit acquires the data processing result from the data storage unit and displays the data processing result on the designated interface.

By applying the method, the data processing result can be displayed on the interface for the user to analyze, and the friendliness of the data processing system is improved.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the method embodiments, since they are substantially similar to the system embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the present application. Any modifications, equivalent substitutions, improvements, etc. that are within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims

1. A data processing system, the system comprising:

2. The system of claim 1, wherein the task configuration operation comprises a structured query language statement of a business;

3. The system of claim 1, wherein the task configuration operation comprises a drag operation and a parameter input operation;

4. A system according to claim 2 or 3, wherein the management control unit is specifically configured to perform syntax and parameter verification on the input source table, and perform data pull verification according to the input source table if the syntax and parameter verification of the input source table passes; performing grammar and parameter verification on the output target table, and performing data output verification according to the output target table under the condition that the grammar and the parameter verification of the output target table pass; and carrying out grammar and parameter verification on the structured query language statement.

5. A system according to claim 2 or 3, wherein the task to be performed is a work pattern;

6. The system according to claim 5, wherein the management control unit is specifically configured to combine the sub-tasks of the same data source into one combined task; and arranging and processing special fields for the combined task and the non-combined subtasks to generate a work running chart.

7. The system of claim 6, wherein the management control unit is further configured to generate a registry of data sources for the combined task;

8. The system of claim 1, wherein the data processing system further comprises: a metadata storage unit;

9. The system of claim 1, wherein the system further comprises: a data storage unit and a data display unit;

10. The data processing method is characterized by being applied to a data processing system, wherein the data processing system comprises a management control unit, a distributed task coordination unit and a plurality of data calculation units, and the plurality of data calculation units comprise a plurality of master data calculation units and a plurality of slave data calculation units; the method comprises the following steps: