CN117435596B

CN117435596B - Streaming batch task integration method and device, storage medium and electronic equipment

Info

Publication number: CN117435596B
Application number: CN202311768754.7A
Authority: CN
Inventors: 赵荣生; 汪磊; 李垚周; 蒋文伟; 孙梓涵; 傅星楠; 詹万科; 朱一飞
Original assignee: Hangzhou Netease Cloud Music Technology Co Ltd
Current assignee: Hangzhou Netease Cloud Music Technology Co Ltd
Priority date: 2023-12-20
Filing date: 2023-12-20
Publication date: 2024-04-02
Anticipated expiration: 2043-12-20
Also published as: CN117435596A

Abstract

The invention relates to the technical field of computers, and discloses a method and a device for integrating batch tasks, a storage medium and electronic equipment. When a task creation request is received, task configuration of the task can be determined in response to the task creation request, wherein the task configuration comprises stream task configuration, batch task configuration and stream batch task integrated configuration, a stream data table and a batch data table are uniformly defined by table metadata, data required by different task configuration can be obtained through uniform query, and further calculation tasks corresponding to the task configuration can be directly performed, and the calculation tasks comprise stream task, batch task or stream batch integrated task. The method is defined uniformly by adopting the table metadata, the configuration logic is simple, a uniform query mode is provided outside a service layer, the maintenance and coordination are convenient, the downstream application can determine the required task configuration and obtain the calculation result only by initiating a task creation request, the task processing logic and caliber are uniform, the operation is simple and convenient, the efficiency is high, the error probability is low, and the service quality is improved.

Description

Streaming batch task integration method and device, storage medium and electronic equipment

Technical Field

Embodiments of the present disclosure relate to the field of computer technology, and more particularly, to a batch task integration method, a batch task integration apparatus, a computer-readable storage medium, and an electronic device.

Background

This section is intended to provide a background or context for the embodiments of the disclosure recited in the claims, which description herein is not admitted to be prior art by inclusion in this section.

In data processing, two processing tasks of a streaming task and a batch task can be supported, wherein the streaming task processes an unbounded data set flowing in real time, the generation of the unbounded data set is started but not ended, and the processing must be performed immediately after the acquisition; the batch task processes the bounded dataset stored offline, with well-defined start and end of the bounded dataset, and can wait for all data to be acquired before performing the computing process.

In a specific business scenario, unified processing of a streaming batch task may be required, where it is generally required to configure the streaming task to a streaming data table and configure the batch task to a batch data table respectively at the upstream of data, and when the application is performed at the downstream of data, the corresponding streaming data table or batch data table may be queried according to the task processing requirement.

However, in the scheme, the flow task and the batch task need to be independently developed and maintained to form a data table, the cost is high, the problem that the processing logic and the caliber of the flow task are inconsistent is likely to exist, unified coordination is difficult, and the generated flow data table and batch data table need to be inquired and called according to a service scene, so that the downstream application operation is complex, errors are easy to occur, and the service quality is influenced.

However, when the song list is recommended to the user in a targeted manner, the accuracy, the conversion rate and the like of the recommendation result still need to be further improved.

Disclosure of Invention

In this context, embodiments of the present disclosure desirably provide a method of integrating a batch job, a device of integrating a batch job, a computer-readable storage medium, and an electronic apparatus.

According to a first aspect of embodiments of the present disclosure, there is provided a method for integrating a batch task, the method may include: receiving a task creation request; in response to a task creation request, determining a task configuration including any one of a streaming task configuration, a batch task configuration, and a streaming batch integrated task configuration; determining corresponding table metadata according to the task configuration, wherein the table metadata is used for uniformly defining a stream data table and a batch data table; and performing a calculation task corresponding to the task configuration based on the table metadata, wherein the calculation task comprises any one of a streaming task, a batch task and a streaming batch integrated task.

Optionally, in the task configuration, the stream task configuration and the batch task configuration are generated in a process of adding the stream batch integrated task configuration through the low code platform.

Optionally, in the task configuration, the stream task configuration, the batch task configuration and the stream batch integrated task configuration are respectively added through corresponding SQL codes.

Optionally, the table metadata is generated by: receiving a table mapping operation; and mapping the corresponding stream data table and the batch data table in response to the table mapping operation, and generating the table metadata for uniformly defining the stream data table and the batch data table.

Optionally, the table metadata is generated by: determining the service to which the stream data table and the batch data table respectively belong; under the condition of the same service, mapping the stream data table and the batch data table to generate the table metadata which uniformly defines the stream data table and the batch data table.

Optionally, the task configuration is batch task configuration, and the computing task corresponding to the task configuration is performed based on the table metadata, including: acquiring a corresponding batch data table according to the table metadata; carrying out batch tasks corresponding to batch task configuration based on batch data in a batch data table; or, the task configuration is stream task configuration, and the computing task corresponding to the task configuration is performed based on the table metadata, which comprises the following steps: acquiring a corresponding stream data table according to the table metadata; and carrying out the flow tasks corresponding to the flow task configuration based on the flow data in the flow data table.

Optionally, the task configuration is a batch integrated task configuration, the batch integrated task configuration includes a specified switching time, and performing a computing task corresponding to the task configuration based on the table metadata includes: acquiring a corresponding stream data table and a corresponding batch data table according to the table metadata; and carrying out the flow batch integration task corresponding to the flow batch integration task configuration based on the batch data before the switching time and the flow data after the switching time in the flow data table.

According to a second aspect of embodiments of the present disclosure, there is provided a batch job integration apparatus, which may include: the request receiving module is used for receiving a task creation request; the configuration determining module is used for responding to the task creation request and determining task configuration, wherein the task configuration comprises any one of stream task configuration, batch task configuration and stream batch integrated task configuration; the data determining module is used for determining corresponding table metadata according to the task configuration, wherein the table metadata are used for uniformly defining a stream data table and a batch data table; and the data calculation module is used for carrying out calculation tasks corresponding to task configuration based on the table metadata, wherein the calculation tasks comprise any one of streaming tasks, batch tasks and streaming batch integrated tasks.

Optionally, a data mapping module is further included, and the data mapping module may include: a mapping operation receiving unit for receiving a table mapping operation; and the mapping operation response unit is used for mapping the corresponding stream data table and the batch data table in response to the table mapping operation and generating the table metadata for uniformly defining the stream data table and the batch data table.

Optionally, a data mapping module is further included, and the data mapping module may include: a service relation determining unit for determining the service to which the stream data table and the batch data table respectively belong; and the service relation mapping unit is used for mapping the stream data table and the batch data table under the condition of belonging to the same service to generate the table metadata for uniformly defining the stream data table and the batch data table.

Optionally, the task configuration is batch task configuration, and the data calculation module is specifically configured to obtain a corresponding batch data table according to the table metadata; carrying out batch tasks corresponding to batch task configuration based on batch data in a batch data table; or, the data calculation module is specifically used for acquiring a corresponding stream data table according to the table metadata; and carrying out the flow tasks corresponding to the flow task configuration based on the flow data in the flow data table.

Optionally, the task is configured as a batch integrated task configuration, wherein the batch integrated task configuration comprises a specified switching time and a data calculation module, and the data calculation module is specifically used for acquiring a corresponding stream data table and a batch data table according to the table metadata; and carrying out the flow batch integration task corresponding to the flow batch integration task configuration based on the batch data before the switching time and the flow data after the switching time in the flow data table.

According to a third aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements any of the above-described batch task integration methods.

According to a fourth aspect of embodiments of the present disclosure, there is provided an electronic device, comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform any of the above-described batch task integration methods via execution of the executable instructions.

According to the method for integrating the tasks in the flow batch, when the task creation request is received, task configuration of the task can be determined in response to the task creation request, including flow task configuration, batch task configuration and flow batch task integration configuration, and because the flow data table and the batch data table are uniformly defined by the table metadata, data required by different task configuration can be obtained through uniform query, and further calculation tasks corresponding to the task configuration, including flow task, batch task or flow batch integration task can be directly carried out based on the queried table metadata. The method adopts the unified definition of the table metadata convection data table and the batch data table, has simple configuration logic, provides a unified query mode outwards at the service level, is convenient for unified maintenance and coordination, can respectively determine the required task configuration and obtain the calculation result by only initiating the task creation request by downstream application, unifies the task processing logic and caliber, has simple and convenient operation, high efficiency and low error probability, and improves the service quality.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings, in which:

FIG. 1 illustrates a flow chart of steps of a batch task integration method of an embodiment of the present disclosure.

FIG. 2 illustrates one of the task configuration user interface schematics of embodiments of the present disclosure.

FIG. 3 illustrates a second exemplary task configuration user interface diagram of an embodiment of the present disclosure.

FIG. 4 shows a flow chart of steps of a method of generating table metadata in an embodiment of the present disclosure.

FIG. 5 illustrates a flowchart of steps for performing a computing task in an embodiment of the present disclosure.

FIG. 6 illustrates an architecture flow diagram of a batch task integration method of an embodiment of the present disclosure.

FIG. 7 illustrates a schematic diagram of a batch job integration apparatus of an embodiment of the present disclosure.

Fig. 8 shows a schematic diagram of a computer-readable storage medium of an embodiment of the present disclosure.

Fig. 9 shows a schematic diagram of an electronic device of an embodiment of the present disclosure.

Detailed Description

The principles and spirit of the present disclosure will be described below with reference to several exemplary embodiments. It should be understood that these embodiments are presented merely to enable one skilled in the art to better understand and practice the present disclosure and are not intended to limit the scope of the present disclosure in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Those skilled in the art will appreciate that embodiments of the present disclosure may be implemented as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the following forms, namely: complete hardware, complete software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to an embodiment of the disclosure, a method for integrating a batch task, a device for integrating a batch task, a computer-readable storage medium and an electronic device are provided.

Any number of elements in the figures are for illustration and not limitation, and any naming is used for distinction only, and not for any limiting sense.

The principles and spirit of the present disclosure are described in detail below with reference to several representative embodiments thereof.

The integrated processing of streaming tasks, batch tasks and streaming batch tasks is the basis of business data processing. Currently, in different business scenarios, users are often required to configure flow tasks to flow data tables and batch tasks to batch data tables, respectively. In practical application, according to the service requirement, the corresponding stream data table is inquired when the stream task is executed, the corresponding batch data table is inquired when the batch task is executed, and the stream data table corresponding to the stream task and the batch data table corresponding to the batch task are respectively inquired when the stream batch integrated task is executed.

It can be seen that in the above scheme, the stream task and the batch task need to be independently developed and maintained, the cost is high, and the processing logic and caliber of the stream task and the batch task may have inconsistent problems, the created stream data table and batch data table need to be queried and called separately according to the requirement, especially in the stream batch integrated task, the batch data table needs to be initialized first, and then when the stream data table is read in real time to update the increment, the operation difficulty is higher for the downstream, and the error is easy to occur, so that the service quality is unstable.

The problems can be improved at different levels at present, but the improvement effect is limited, the operation complexity is further increased, the usability is reduced, the popularization cost is high, and the wide application is difficult. For example, at the computing layer, a mixed table containing a bounded data set and an unbounded data set can be created through a flank (framework and a distributed processing engine), and switching conditions and strategies between the bounded data set and the unbounded data set and a state conversion strategy between stream data and batch data in the mixed table are defined, the method requires a user to reconfigure the whole data at a storage layer and rewrite maintenance logic, the problem of incompatibility of grammar possibly exists, the configuration logic is complex, the cost is high, and the method only supports processing input streams in the data table, but cannot associate the maintenance table with a data source, so that service analysis is not facilitated; the storage layer can also store data in a layered and sliced manner in a 'streaming' manner through a Pulsar (Yun Yuansheng distributed message flow platform), so that the requirements of real-time reading of streaming data and concurrent multi-slice reading of batch data are met at the same time, or abstraction of table metadata can be created based on Iceberg (table format), the abstraction comprises partition information, a storage path, a storage format, statistical information and the like of each data table, indexes of file lists, file snapshots and the like corresponding to the table metadata are created, and writing and reading of streaming and batch data can be supported based on the indexes, but the storage scheme needs to be further matched with a service layer, requires development design of a task model, is difficult to be compatible with the existing task configuration, and has high use cost.

In the embodiment of the disclosure, the usability of the integration of the flow batch can be realized on the model layer based on the service scene, wherein unified table metadata is established between the flow data table and the batch data table, so that the configuration and maintenance of the flow task and the batch task are not needed, and the processing efficiency is improved and the error probability is reduced through unified configuration, calling and maintenance; on the basis, the flow data table and the batch data table can be further defined uniformly based on manual or automatic mapping, and task configuration is performed through an SQL code or low-code platform, so that a proper configuration mode is selected based on service requirements and configuration conditions, individuation and complexity of configuration are balanced, task configuration and data table definition are highly autonomous, bottom codes of the task configuration and the data table definition can be shielded, different calculation tasks can be executed by one-time simple configuration, maintenance cost is remarkably reduced, and usability and consistency of integration of the flow batch tasks are improved.

Having described the basic principles of the present invention, various non-limiting embodiments of the invention are described in detail below.

Exemplary application scenarios

It should be noted that the following application scenarios are only shown for facilitating understanding of the spirit and principles of the present invention, and embodiments of the present disclosure are not limited in this respect. Rather, embodiments of the present disclosure may be applied to any scenario where applicable.

The method for integrating the streaming batch tasks can be applied to various application scenes related to business data processing.

In one application scenario, a music platform may be involved. Typically, in such an application scenario, a user may interact with different services, where the services may include specific songs, albums, MVs (Music videos), music stations, etc., and the interactions may include searching, browsing, playing, commenting, and collecting, etc. In the music platform, the counts of the interactions may be counted to facilitate thermal analysis of specific services, or portrayal of related users, or service recommendations based on thermal analysis, user portrayal, etc. In the interactive counting statistics, the calculation task performed in the real-time state of the reaction service is a streaming task, the streaming task processes stream data generated in real time based on sequence, if user 1 generates praise to song A at the moment of 14:00, the praise count of the streaming task to song A is increased by 1, user 2 generates praise to song A at the moment of 14:01, and the praise count of the streaming task to song A is increased by 1 again; the calculation task performed by the state in the duration of the reaction service is a batch task, and the batch task can process batch data generated in the duration in parallel, such as counting the praise number of the song A in one day or counting the total number of the songs praise by the user 1 in one day; the task of integrating streaming batch is generally to read historical batch data and process real-time streaming data in time, such as initializing based on the batch data when synchronizing user data and performing incremental update based on user interaction with business in real time. In the business analysis with high real-time requirement, the flow task can be adopted to perform calculation processing, such as real-time heat statistics and real-time refreshing audio and video recommendation, and the business analysis with large data volume and delay processing can be adopted to perform calculation processing by batch tasks, such as annual song listening report, daily refreshing song recommendation and the like. In this application scenario, the method for integrating the tasks of the streaming batch according to the embodiment of the present disclosure may be used, and unified table metadata is used for defining the streaming data tables and batch data tables of different services, and corresponding task configurations are determined based on the task creation request, so as to execute corresponding calculation tasks, and determine the calculation results of the streaming tasks, batch tasks or streaming batch integrated tasks.

In another application scenario, commodity transactions may be involved, such as processing of real-time transactions, and analysis of total orders; or map navigation, such as obtaining state data and position data reported by the equipment in real time, and historical travel track. The application scenario mentioned above is only partially exemplified, and those skilled in the art may expand applications based on requirements of streaming tasks, batch tasks and streaming batch integrated task processing, which is not particularly limited by the embodiments of the present disclosure.

Exemplary method

A method of integrating a batch task according to an exemplary embodiment of the present disclosure will be described below with reference to fig. 1 in conjunction with the above application scenario.

The step flow of the batch task integration method according to the exemplary embodiment of the present disclosure as illustrated in fig. 1 may include the following steps 101 to 104:

step 101, receiving a task creation request.

In the embodiment of the present disclosure, the task creation request may include an indicated service, or may include an indication of a specific task configuration, such as a type and version of the task configuration adopted, or information required for creating a new task configuration, which is not specifically limited in this disclosure, and a person skilled in the art may define the information included in the task creation request according to an actual service requirement and a configuration condition.

Step 102, in response to the task creation request, determining a task configuration, wherein the task configuration comprises any one of a stream task configuration, a batch task configuration and a stream batch integrated task configuration.

In the embodiment of the disclosure, when a task creation request is received, a task configuration required for executing a computing task can be determined in response to the task creation request, the task configuration can be determined in the created task configuration based on an indication of the task creation request, and the task configuration can be newly obtained based on information contained in the task creation request. The task configuration may include a stream task configuration, a batch task configuration, a stream batch integrated task configuration, and the like.

Step 103, determining corresponding table metadata according to the task configuration, wherein the table metadata are used for uniformly defining a stream data table and a batch data table.

In the embodiment of the disclosure, the stream data table and the batch data table can be mapped, so that the stream data table and the batch data table with the mapping relation are uniformly defined based on the table metadata. The mapping relationship may be custom, may be mapped according to a service relationship, or may be mapped according to a statistical requirement, and in the embodiment of the present disclosure, the manner of establishing the flow data table and the batch data table is not particularly limited.

After the stream data table and the batch data table are uniformly defined based on the table metadata, the confirmed stream batch one table can point to the stream data table required by the stream task, can point to the batch data table required by the batch task, and can also point to the stream data table and the batch data table required by the stream batch integrated task. The task configuration can be constructed based on the table metadata, and different task configurations can point to one data table on the basis of a flow batch integrated table uniformly defined by the table metadata, or seamless switching is performed between the flow data table and the batch data table according to the calculation task indicated by the task creation request. On the basis, table metadata is determined according to task configuration, namely a uniformly defined flow batch body table is determined, and a corresponding data table is selected according to the calculation task to be executed.

Step 104, performing a calculation task corresponding to the task configuration based on the table metadata, wherein the calculation task comprises any one of a streaming task, a batch task and a streaming batch integrated task.

In the embodiment of the disclosure, on the basis of determining a data table required by task configuration based on table metadata, a corresponding calculation task can be executed according to the task configuration determined by the task creation request, for example, a streaming task is executed to calculate streaming data in real time; or executing batch tasks to perform off-line calculation on batch data; or, the flow batch integrated task is executed to perform off-line initialization statistics based on batch data, then the flow batch integrated task is switched to perform real-time calculation based on the flow data, and the flow batch data integration is used for realizing flexible execution and switching of the flow task and the batch task on a business layer, unifying processing logic and reducing development and maintenance cost.

Further, the task configuration can be created by selecting different modes based on service requirements, configuration conditions and the like, for example, manual configuration can be adopted when the task configuration flexibility and individuation requirements are high, understanding of the underlying technology is insufficient, and automatic configuration can be adopted when configuration and use cost are required to be reduced. Those skilled in the art may specifically choose different configurations, and specific configuration means, such as a language and format used for manual configuration, tools used for automated configuration, etc., which are not specifically limited by the embodiments of the present disclosure.

In an alternative method embodiment of the present disclosure, in the task configuration, the stream task configuration and the batch task configuration are generated during the addition of the stream batch integrated task configuration by the low code platform.

In the embodiment of the disclosure, an automatic task configuration means can be provided through a Low code platform (Low-Code Development Platform, low code development platform), wherein the Low code platform is a visual development tool which can be developed through zero codes or a small amount of codes, and task configuration can be conveniently completed through dragging components and form filling configuration in a graphical interface. In the low-code platform, the stream task configuration and the batch task configuration can be automatically generated by configuring the stream batch integrated task configuration, the operation of the task configuration is further simplified on the basis of reducing the task configuration and the use threshold, the configuration of different types of calculation tasks is not required to be respectively carried out each time, and the efficiency is improved.

One of the task configuration user interface schematics of the exemplary embodiment of the present disclosure, as shown in fig. 2, includes different environment options of "development environment", "test environment", and "online environment" at the "task information" option, as shown in fig. 2; in the user interface of the "development environment" option, a component selection area a and a task configuration area b are provided. Component selection area a includes a component search box "search component" and a component instantiation area. The component example area provides examples of different types of components such as "input node", "model node", "output node", and "input node" examples of components such as "data stream", "stream batch one table", "batch data table", "stream data table", and the like; "model nodes" exemplify "relational model" components; "output node" exemplifies "node type 1", "node type 2", "node type 3", and so on.

At this time, the task configuration may be performed by dragging the component illustrated in the component selection area a to the task configuration area b. As shown in fig. 2, by dragging the "batch body table" component to the task configuration area b, a blank task configuration of "batch body table" is generated, and the input node "batch body table" component can configure data input of the batch integrated task when executing.

Furthermore, links such as data calculation, result output and the like can be configured through other components so as to create a complete flow batch integrated task flow.

As shown in fig. 3, a task configuration user interface of the exemplary embodiment of the present disclosure is second, and as shown in fig. 3, a corresponding configuration input page c may be expanded by selecting a blank task configuration "group to group table" component of the task configuration area b, configuration input of "group to group table" may be performed on the configuration input page c, table metadata may be indicated, and a configuration type may be generated, where the configuration type includes an offline configuration corresponding to the group data table and a real-time configuration corresponding to the group data table; the method can also comprise field expansion, field mapping and the like, and performs personalized data expansion and association to a data source and the like, such as adding a dimension table and a custom output field in a 'field expansion' filling item, and adding mapping of a source table field and a model field in a 'field mapping' filling item; and, performing window time configuration, including setting of event time field, time unit, maximum tolerant delay and the like.

The above component examples and configuration page filler are merely examples, and other components and configuration page filler may be adopted by those skilled in the art based on service requirements, configuration conditions, etc., which are not particularly limited in the embodiments of the present disclosure.

In an alternative method embodiment of the present disclosure, in the task configuration, the stream task configuration, the batch task configuration, and the stream batch integrated task configuration are added through corresponding SQL codes, respectively.

In the embodiment of the disclosure, SQL (Structured Query Language ) codes can be used for operating and defining data, and through deep understanding of the underlying technology, the SQL codes can be adopted to more flexibly and individually perform configuration of different services corresponding to different computing tasks.

Furthermore, the table metadata can be generated before the task configuration is created, different modes can be selected based on service requirements, configuration conditions and the like, for example, when service relation logic is complex, a batch data table, a stream data table and field mapping relations thereof can be mapped manually, for example, table data after multi-layer processing of a plurality of bins; or, if the service relation is simpler, the batch data table and the stream data table, such as a data table imported from a database, a data table put in storage from an original log, and the like, can be automatically mapped to generate uniformly defined table metadata. Those skilled in the art may specifically choose different mapping manners, and specific defining means, and the embodiments of the present disclosure are not limited in this regard.

The flow of steps of the table metadata generation method of the exemplary embodiment of the present disclosure as shown in fig. 4.

In an alternative method embodiment of the present disclosure, as shown in fig. 4, the foregoing table metadata may be generated by the following steps 401 to 402:

step 401, receiving a table mapping operation.

Step 402, mapping the corresponding stream data table and the batch data table in response to the table mapping operation, so as to generate the table metadata for uniformly defining the stream data table and the batch data table.

In an embodiment of the present disclosure, a manual table mapping operation may be received, where a batch data table, a stream data table, and a mapping relationship of fields between the two tables may be specified, so as to generate table metadata that uniformly defines the stream data table and the batch data table in response to the table mapping operation.

In the disclosed embodiments, in the storage based on the relational database, a data synchronization manner of incremental synchronization and full synchronization is generally included. As shown in fig. 5, the step flow of the synchronization of the relational database according to the exemplary embodiment of the present disclosure, the full synchronization of the relational database synchronizes the business relational data, such as the database data of the music platform, the member data, etc., in the HDFS (distributed file system) in a predetermined period, where the business relational data conforms to the characteristics of the batch data, and may support batch task calculation; incremental synchronization performs incremental change data subscription through a message queue of Kafka (message system), and the incremental change data conforms to the characteristics of stream data and can support stream task calculation. In the context of relational databases, the generation of table metadata that uniformly defines a stream data table and a batch data table may be supported.

For example, using Kafka to store a stream data table in a storage layer, using HDFS to store a batch data table, a specified stream data table 1 may be determined from Kafka and a specified batch data table 1 may be determined from HDFS, respectively, in response to a table mapping operation, and field mapping may be performed on the stream data table 1 and the batch data table 1 to generate table metadata for a batch of streams.

In an alternative method embodiment of the present disclosure, as shown in fig. 4, the foregoing table metadata may be generated by the following steps 403 to 404:

step 403, determining the services to which the stream data table and the batch data table respectively belong.

Step 404, mapping the stream data table and the batch data table under the condition of the same service, and generating the table metadata for uniformly defining the stream data table and the batch data table.

In the embodiment of the disclosure, the service relationship may also be used as a basis of automatic mapping, and when the stream data table and the batch data table are newly added, the stream data table and the batch data table of the same service may be uniformly defined according to the service to which the stream data table and the batch data table belong, so as to obtain the table metadata of the stream batch one table to support the stream task, the batch task and the stream batch integrated task.

For example, when Kafka is used to store the stream data table in the storage layer, HDFS is used to store the batch data table, and Kafka is newly added to stream data table 2 and HDFS is newly added to batch data table 2, and stream data table 2 and batch data table 2 come from the same library, the stream data table 2 and batch data table 2 are automatically mapped, so as to generate the table metadata of the stream batch one table 2.

Taking the simplified library table as an example, the following is shown:

```

CREATE TABLE song_db.song_info(

song_id bigint(32),

name varchar(128),

artist_id bigint(32),

ext_info varchar(256)

…

);

```

the simplified library table is automatically mapped to generate metadata as follows in Kafka newly added stream data table song_info_incr and in HDFS newly output batch data table song_info_all:

```

SELECT song_id, name, artist_id FROM song_db.song_info_incr

```

it should be noted that the above storage manner is only used as an example, and in the embodiment of the present disclosure, the bottom storage may support not only the independent storage of the stream data table and the batch data table, but also the unified storage of the stream data table and the batch data table. The method and the device for processing the stream data in the storage layer have the advantages that the stream data table and the batch data table are uniformly defined as the table metadata for the computing engine to adopt, the details of the computing layer are shielded, the adaptation of the underlying structure of the storage layer is not required, the table view with uniform visual angles is provided for a user, and the stream task and the batch task are automatically and flexibly switched.

In the embodiment of the disclosure, in response to a task creation request, the task creation request may be parsed to query corresponding table metadata, and then corresponding table data may be selected for computing task applications according to task configuration, a batch data table may be selected when the task is configured as a batch task configuration, a stream data table may be selected when the task is configured as a stream task configuration, and a batch data table and a stream data table may be selected when the task is configured as a stream batch integrated configuration. In the embodiment of the disclosure, batch tasks, stream tasks and stream batch integrated tasks can be supported by referring to the table metadata, so that each computing engine such as a Flink, spark (open source cluster operation framework) and the like can be supported.

For example, based on the foregoing example, a batch data table in a relational database may be defined as follows:

CREATE TABLE catalog.db.order_batch (

`order_id` BIGINT,

`total_price` DOUBLE,

`create_time` TIMESTAMP,

`order_owner_id` BIGINT

) WITH (

‘connector.type’=’hdfs’,

‘connector.hdfs.location’=’hdfs://user/hive/warehouse/order’

);

the flow data table may be defined as follows:

CREATE TABLE catalog.db.order_stream (

`order_id` BIGINT,

`total_price` DOUBLE,

`create_time` TIMESTAMP,

`order_owner_id` BIGINT

) WITH (

‘connector.type’=’kafka’,

‘connector.kafka.topic’=’order_topic’,

‘connector.kafka.bootstrap.server’=’kafka_cluster1’

);

the flow batch integration table is defined as follows:

CREATE TABLE catalog.db.order_hybrid (

`order_id` BIGINT,

`total_price` DOUBLE,

`create_time` TIMESTAMP,

`order_owner_id` BIGINT

) WITH (

‘connector.type’=’hybrid’,

‘connector.hybrid.stream’=’catalog.db.order_stream’,

‘connector.hybrid.batch’=’catalog.db.order_batch’,

‘connector.hybrid.schema.mapping’=’order_id:order_id,total_price: total_pric,create_time:create_time,order_owner_id:order_owner_id’

);

the flow of steps for performing a computing task in an exemplary embodiment of the present disclosure as illustrated in fig. 5.

In an alternative method embodiment of the present disclosure, when the task is configured as a batch task configuration, performing, based on the table metadata, a computing task corresponding to the task configuration may include the following steps 501 to 502:

step 501, a corresponding batch data table is obtained according to the table metadata.

Step 502, performing batch tasks corresponding to batch task configuration based on batch data in a batch data table.

In the embodiment of the disclosure, when the task is configured as a batch task configuration, the batch data table can be automatically selected based on the table metadata of the batch data table and the stream data table which are uniformly defined, and the batch task corresponding to the batch task configuration is performed based on the batch data in the batch data table for calculation.

For example, the compute engine references the table metadata as follows:

INSERT INTO test_catalog.db.table_test

SELECT

`order_id`,

`total_price`

FROM catalog.db.order_hybrid

WHERE `order_id` is not null;

in the case where the task is configured as a batch task, then the batch data table is referenced as follows:

catalog.db.order_batch

so that the corresponding batch task is performed based on the batch data in the batch data table.

In an optional method embodiment of the present disclosure, when the task is configured as a streaming task, performing the computing task corresponding to the task configuration based on the table metadata may include the following steps 503 to 504:

step 503, obtaining a corresponding stream data table according to the table metadata.

Step 504, based on the stream data in the stream data table, performing the stream task corresponding to the stream task configuration.

In the embodiment of the disclosure, when the task is configured as a stream task configuration, the stream data table can be automatically selected based on the table metadata of the unified definition batch data table and the stream data table, and the stream task corresponding to the stream task configuration is performed based on the stream data in the stream data table for calculation.

For example, the compute engine references the table metadata as follows:

INSERT INTO test_catalog.db.table_test

SELECT

`order_id`,

`total_price`

FROM catalog.db.order_hybrid

WHERE `order_id` is not null;

in the case where the task is configured as a streaming task, then the reference streaming data table is as follows:

catalog.db.order_stream

thus, based on the stream data in the stream data table, the corresponding stream task is performed.

When the task is configured as the stream task configuration, the computing engine can acquire stream data from the stream data table based on the table metadata to perform computation, and execute the stream task corresponding to the stream task configuration.

In an optional method embodiment of the present disclosure, the task configuration is a batch integrated task configuration, where when the batch integrated task configuration includes making the switching time, performing, based on the table metadata, the computing task corresponding to the task configuration may include the following steps 505 to 506:

And 505, acquiring a corresponding stream data table and a corresponding batch data table according to the table metadata.

And step 506, performing the flow batch integration task corresponding to the flow batch integration task configuration based on the batch data before the switching time and the flow data after the switching time in the flow data table.

In the embodiment of the disclosure, when the task is configured as a stream batch integrated task configuration, seamless switching between the stream task and the batch task is supported. Specifically, based on the table metadata of the batch data table and the stream data table which are defined in a unified manner, the stream data table and the batch data table can be automatically selected, batch data before the switching time is specified in the batch data table and stream data after the switching time is specified in the stream data table are calculated based on the specified switching time, and the stream batch integration task corresponding to the stream batch integration task configuration is performed.

If the specified switching time is 2023, 6, zero, then the calculation engine references the table metadata as follows:

as described above, seamless switching between batch tasks and stream tasks in the batch integrated task can be supported, a corresponding stream data table and batch data table are acquired based on the table metadata, and the corresponding batch integrated task is configured based on batch data before the zero point of 2023, 6 months and 6 days in the batch data table and stream data after the zero point of 2023, 6 months and 6 days in the stream data table.

FIG. 6 also illustrates an architectural flow diagram of a method of integrating a batch task in accordance with an exemplary embodiment of the present disclosure. As shown in fig. 6, the method can be applied to a storage layer 601, a metadata center 602, a calculation layer 603, and a model layer 604.

Model layer 604 may employ a low code platform for automating task configuration and determining task configuration to be performed in response to a task creation request.

The computing layer 603 determines the table metadata at the metadata center 602 based on the determined task configuration and determines the data table that the metadata center 602 automatically selects based on the task configuration and the table metadata. The computation layer 603 may be implemented using a computing engine such as a Flink, spark, or the like.

The computing layer 603 may also read a corresponding data table from the storage layer 601 based on the determined data table, and execute a computing task corresponding to the task configuration based on the data table, such as obtaining the determined batch data table based on the batch task configuration and executing the batch task to perform the computation, obtaining the determined stream data table based on the stream task configuration and executing the stream task to perform the computation, obtaining the determined stream data table and the batch data table based on the stream batch integration task configuration and executing the stream batch integration task to perform the computation.

The metadata center 602 determines the batch data table and the stream data table from the storage layer 601, for example, column names (column) of different stream data tables and column names of different batch data tables can be determined; then, based on the service relation, automatic mapping is carried out to obtain the table metadata of the unified definition stream data table and the batch data table, as shown in fig. 6, "batch table column name 2" and "stream table column name 1" belong to the same curved library, the automatic mapping is carried out to obtain "stream batch one table metadata 1", the batch table column name 1 "and" stream table column name 3 "belong to another curved library, the automatic mapping is carried out to obtain" stream batch one table metadata 2", and so on, and the description is omitted; and, automatically determining a corresponding data table to the computing layer 603 based on the task configuration provided by the computing layer 603.

The storage layer 601 may be a separate stream data table and batch data table storage, or may be a unified storage of a stream data table and a batch data table. The storage layer 601 may be implemented by Kafka, HDFS, or the like, or may be implemented by Pulsar, iceberg, or the like.

Exemplary apparatus

Having described the batch job integration method of the exemplary embodiment of the present disclosure, next, a batch job integration apparatus of the exemplary embodiment of the present disclosure will be described with reference to fig. 7.

It should be noted that, other specific details of each functional module of the batch task integrating device in the embodiment of the present disclosure are already described in the above embodiment of the batch task integrating method, and are not described herein again.

Fig. 7 illustrates a batch job integration apparatus 700 of an exemplary embodiment of the present disclosure, including:

the request receiving module 701 is configured to receive a task creation request.

The configuration determining module 702 is configured to determine, in response to a task creation request, a task configuration including any one of a streaming task configuration, a batch task configuration, and a streaming batch integrated task configuration.

The data determining module 703 is configured to determine corresponding table metadata according to the task configuration, where the table metadata is used to define the flow data table and the batch data table in a unified manner.

The data calculation module 704 is configured to perform a calculation task corresponding to the task configuration based on the table metadata, where the calculation task includes any one of a streaming task, a batch task and a streaming batch integrated task.

In an alternative apparatus embodiment of the present disclosure, in the task configuration, the stream task configuration and the batch task configuration are generated during the addition of the stream batch integrated task configuration by the low code platform.

In an alternative embodiment of the present disclosure, in the task configuration, the stream task configuration, the batch task configuration, and the stream batch integrated task configuration are added through corresponding SQL codes, respectively.

In an optional apparatus embodiment of the disclosure, further comprising a data mapping module, the data mapping module may comprise:

a mapping operation receiving unit for receiving a table mapping operation;

and the mapping operation response unit is used for mapping the corresponding stream data table and the batch data table in response to the table mapping operation and generating the table metadata for uniformly defining the stream data table and the batch data table.

a service relation determining unit for determining the service to which the stream data table and the batch data table respectively belong;

and the service relation mapping unit is used for mapping the stream data table and the batch data table under the condition of belonging to the same service to generate the table metadata for uniformly defining the stream data table and the batch data table.

In an alternative embodiment of the present disclosure, the task configuration is a batch task configuration, and the data calculation module 704 is specifically configured to obtain a corresponding batch data table according to the table metadata; carrying out batch tasks corresponding to batch task configuration based on batch data in a batch data table;

or, the data calculation module 704 is specifically configured to obtain a corresponding stream data table according to the table metadata; and carrying out the flow tasks corresponding to the flow task configuration based on the flow data in the flow data table.

In an optional device embodiment of the present disclosure, the task configuration is a batch-integrated task configuration, where the batch-integrated task configuration includes a specified switching time, and the data calculation module 704 is specifically configured to obtain a corresponding stream data table and a batch data table according to the table metadata; and carrying out the flow batch integration task corresponding to the flow batch integration task configuration based on the batch data before the switching time and the flow data after the switching time in the flow data table.

According to the flow batch task integration device of the embodiment of the disclosure, when a task creation request is received, task configuration of the flow batch task integration device can be determined in response to the task creation request, wherein the task configuration comprises flow task configuration, batch task configuration and flow batch task integration configuration, and because the flow data table and the batch data table are uniformly defined by the table metadata, data required by different task configuration can be obtained through uniform query, and further calculation tasks corresponding to the task configuration, comprising the flow task, the batch task or the flow batch integration task, can be directly carried out based on the queried table metadata. The method adopts the unified definition of the table metadata convection data table and the batch data table, has simple configuration logic, provides a unified query mode outwards at the service level, is convenient for unified maintenance and coordination, can respectively determine the required task configuration and obtain the calculation result by only initiating the task creation request by downstream application, unifies the task processing logic and caliber, has simple and convenient operation, high efficiency and low error probability, and improves the service quality.

It should be noted that although several modules or units of a batch task integration device are mentioned in the above detailed description, such partitioning is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

Exemplary computer-readable storage Medium

Computer-readable storage media of exemplary embodiments of the present disclosure are described below.

In the present exemplary embodiment, with reference to fig. 8, a program product 800 for implementing the above-described method according to an exemplary embodiment of the present disclosure is described, as a portable compact disc read only memory (CD-ROM) may be employed and include program code, and may be run on a device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product 800 may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RE, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a local area network (FAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

Exemplary electronic device

An electronic device of an exemplary embodiment of the present disclosure is described with reference to fig. 9.

The electronic device 900 shown in fig. 9 is merely an example and should not be construed to limit the functionality and scope of use of embodiments of the present disclosure in any way.

As shown in fig. 9, the electronic device 900 is embodied in the form of a general purpose computing device. Components of electronic device 900 may include, but are not limited to: at least one processing unit 910, at least one storage unit 920, a bus 930 connecting the different system components (including the storage unit 920 and the processing unit 910), a display unit 940.

Wherein the storage unit stores program code that is executable by the processing unit 910 such that the processing unit 910 performs steps according to various exemplary embodiments of the present disclosure described in the above-described "exemplary methods" section of the present specification. For example, the processing unit 910 may perform method steps as shown in fig. 1, etc.

The storage unit 920 may include volatile storage units such as a random access storage unit (RAM) 921 and/or a cache storage unit 922, and may further include a read only storage unit (ROM) 923.

The storage unit 920 may also include a program/utility 924 having a set (at least one) of program modules 925, such program modules 925 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

Bus 930 may include a data bus, an address bus, and a control bus.

The electronic device 900 may also communicate with one or more external devices 1000 (e.g., keyboard, pointing device, bluetooth device, etc.) via an input/output (I/O) interface 950. The electronic device 900 also includes a display unit 940 that is connected to an input/output (I/O) interface 950 for displaying. Also, electronic device 900 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 960. As shown, the network adapter 960 communicates with other modules of the electronic device 900 over the bus 930. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 900, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

It should be noted that while several modules or sub-modules of the apparatus are mentioned in the detailed description above, such partitioning is merely exemplary and not mandatory. Indeed, the features and functionality of two or more units/modules described above may be embodied in one unit/module in accordance with embodiments of the present disclosure. Conversely, the features and functions of one unit/module described above may be further divided into ones that are embodied by a plurality of units/modules.

Furthermore, although the operations of the methods of the present disclosure are depicted in the drawings in a particular order, this is not required to or suggested that these operations must be performed in this particular order or that all of the illustrated operations must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

While the spirit and principles of the present disclosure have been described with reference to several particular embodiments, it is to be understood that this disclosure is not limited to the particular embodiments disclosed nor does it imply that features in these aspects are not to be combined to benefit from this division, which is done for convenience of description only. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A method for integrating a batch task, the method comprising:

receiving a task creation request;

determining a task configuration in response to the task creation request, wherein the task configuration comprises any one of a stream task configuration, a batch task configuration and a stream batch integrated task configuration; the task configuration is constructed based on table metadata, so that different task configurations are based on a flow batch body table uniformly defined by the table metadata, and a corresponding flow data table or a batch data table is selected according to a calculation task indicated by the task creation request, or the flow data table and the batch data table are switched;

Determining corresponding table metadata according to the task configuration, wherein the table metadata are used for uniformly defining a stream data table and a batch data table with a mapping relation, and the table metadata are constructed when the stream data table and the batch data table are newly added;

and carrying out a calculation task corresponding to the task configuration based on the table metadata, wherein the calculation task comprises any one of a streaming task, a batch task and a streaming batch integrated task.

2. The method of claim 1, wherein in the task configuration, the stream task configuration and the batch task configuration are generated during addition of a stream batch integrated task configuration by a low code platform.

3. The method of claim 1, wherein in the task configuration, the stream task configuration, the batch task configuration, and the stream batch integrated task configuration are each added by corresponding SQL code.

4. The method of claim 1, wherein the table metadata is generated by:

receiving a table mapping operation;

and mapping the corresponding stream data table and the batch data table in response to the table mapping operation, and generating the table metadata which uniformly defines the stream data table and the batch data table.

5. The method of claim 1, wherein the table metadata is generated by:

determining services to which the stream data table and the batch data table respectively belong;

and under the condition of belonging to the same service, mapping the stream data table and the batch data table to generate the table metadata which uniformly define the stream data table and the batch data table.

6. The method of claim 1, wherein the task configuration is a batch task configuration, and wherein the performing a computing task corresponding to the task configuration based on the table metadata comprises:

acquiring a corresponding batch data table according to the table metadata;

based on the batch data in the batch data table, carrying out the batch tasks corresponding to the batch task configuration;

or, the task configuration is a streaming task configuration, and the performing, based on the table metadata, a computing task corresponding to the task configuration includes:

acquiring a corresponding stream data table according to the table metadata;

and carrying out the flow task corresponding to the flow task configuration based on the flow data in the flow data table.

7. The method of claim 1, wherein the task configuration is a batch integrated task configuration, the batch integrated task configuration including a specified switching time, the performing a computing task corresponding to the task configuration based on the table metadata comprising:

Acquiring a corresponding stream data table and a corresponding batch data table according to the table metadata;

and carrying out the flow batch integration task corresponding to the flow batch integration task configuration based on the batch data before the switching time and the flow data after the switching time in the flow data table.

8. A batch job integration apparatus, the apparatus comprising:

the request receiving module is used for receiving a task creation request;

a configuration determining module, configured to determine a task configuration in response to the task creation request, where the task configuration includes any one of a streaming task configuration, a batch task configuration, and a streaming batch integrated task configuration; the task configuration is constructed based on table metadata, so that different task configurations are based on a flow batch body table uniformly defined by the table metadata, and a corresponding flow data table or a batch data table is selected according to a calculation task indicated by the task creation request, or the flow data table and the batch data table are switched;

the data determining module is used for determining corresponding table metadata according to the task configuration, the table metadata are used for uniformly defining a stream data table and a batch data table with a mapping relation, and the table metadata are constructed when the stream data table and the batch data table are newly added;

And the data calculation module is used for carrying out calculation tasks corresponding to the task configuration based on the table metadata, wherein the calculation tasks comprise any one of streaming tasks, batch tasks and streaming batch integrated tasks.

9. The apparatus of claim 8, wherein in the task configuration, the stream task configuration and the batch task configuration are generated during addition of a stream batch integrated task configuration by a low code platform.

10. The apparatus of claim 8, wherein in the task configuration, the stream task configuration, the batch task configuration, and the stream batch integrated task configuration are each added by corresponding SQL code.

11. The apparatus of claim 8, further comprising a data mapping module, the data mapping module comprising:

a mapping operation receiving unit for receiving a table mapping operation;

and the mapping operation response unit is used for mapping the corresponding stream data table and the batch data table in response to the table mapping operation, and generating the table metadata which uniformly define the stream data table and the batch data table.

12. The apparatus of claim 8, further comprising a data mapping module, the data mapping module comprising:

A service relation determining unit, configured to determine services to which the stream data table and the batch data table respectively belong;

and the service relation mapping unit is used for mapping the stream data table and the batch data table under the condition of belonging to the same service to generate the table metadata which uniformly defines the stream data table and the batch data table.

13. The apparatus according to claim 8, wherein the task configuration is a batch task configuration, and the data calculation module is specifically configured to obtain a corresponding batch data table according to the table metadata; based on the batch data in the batch data table, carrying out the batch tasks corresponding to the batch task configuration;

or, the data calculation module is specifically configured to obtain a corresponding stream data table according to the table metadata; and carrying out the flow task corresponding to the flow task configuration based on the flow data in the flow data table.

14. The apparatus according to claim 8, wherein the task configuration is a batch-integrated task configuration, the batch-integrated task configuration including a specified switching time, the data calculation module being specifically configured to obtain a corresponding stream data table and a batch data table according to the table metadata; and carrying out the flow batch integration task corresponding to the flow batch integration task configuration based on the batch data before the switching time and the flow data after the switching time in the flow data table.

15. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a batch job integration method according to any one of claims 1 to 7.

16. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the batch task integration method of any one of claims 1 to 7 via execution of the executable instructions.