CN115904651A

CN115904651A - Task processing method and system of big data platform and big data platform

Info

Publication number: CN115904651A
Application number: CN202211336155.3A
Authority: CN
Inventors: 邓晓; 韩志华; 赵孔明
Original assignee: Shanghai Bilibili Technology Co Ltd
Current assignee: Shanghai Bilibili Technology Co Ltd
Priority date: 2022-10-28
Filing date: 2022-10-28
Publication date: 2023-04-04

Abstract

The embodiment of the invention discloses a task processing method and system of a big data platform and the big data platform. The method comprises the following steps: the method comprises the steps that a first system creates a first task and sends attribute information of the first task to a second system; the second system creates a virtual task of the first task, identifies a target second task depending on the first task from second tasks in the second system according to the attribute information of the first task, and establishes a dependency relationship between the target second task and the virtual task; the first system sends the task state information of the first task to the second system; and the second system updates the task state of the virtual task according to the task state information of the first task and determines whether to execute the target second task according to the task state of the virtual task. By adopting the scheme, the cross-system task dependence can be changed into the same-system task dependence, the system overhead caused by frequent cross-system polling is reduced, the system resources are saved, and the task processing efficiency is improved.

Description

Task processing method and system of big data platform and big data platform

Technical Field

The embodiment of the invention relates to the technical field of data processing, in particular to a task processing method and system for a big data platform and the big data platform.

Background

The big data platform is a network platform which performs services through content sharing, resource sharing, channel co-construction and/or data sharing. Large data platforms typically contain multiple different systems, and there may be a cross-system task dependency between the multiple different systems, i.e., a task in one system is a downstream task of a task in another system.

However, the inventor finds that the following defects exist in the prior art in the implementation process: in the prior art, a processing mode of polling an upstream task by a downstream task is adopted, and when the downstream task and the upstream task belong to different systems, polling among the systems is required to be frequently carried out, so that the task execution efficiency is low and the system overhead is large.

Disclosure of Invention

In view of the technical problems of low task execution efficiency and system overhead in the prior art, embodiments of the present invention are proposed to provide a task processing method, system, big data platform, computing device and computer storage medium that overcome or at least partially solve the above problems.

According to a first aspect of an embodiment of the present invention, a task processing method for a big data platform is provided, where the big data platform includes a first system and a second system, and the method includes:

the method comprises the steps that a first system creates a first task and sends attribute information of the first task to a second system;

the second system creates a virtual task of the first task, identifies a target second task depending on the first task from second tasks in the second system according to the attribute information of the first task, and establishes a dependency relationship between the target second task and the virtual task;

the first system sends the task state information of the first task to a second system;

and the second system updates the task state of the virtual task according to the task state information of the first task and determines whether to execute a target second task which is dependent on the virtual task according to the task state of the virtual task.

In an optional implementation, the attribute information of the first task includes: output data information of the first task;

the identifying a target second task dependent on the first task from second tasks in a second system according to the attribute information of the first task further comprises: and acquiring input data information of a second task in a second system, comparing the input data information of the second task with the output data information of the first task, and determining the second task with the input data information matched with the output data information of the first task as a target second task.

In an optional embodiment, the determining the second task that matches the input data information with the output data information of the first task as the target second task further comprises:

and if the input data information of the second task contains the output data information of the first task, determining that the second task is a target second task.

In an optional embodiment, the output data information includes an output data table identifier, and the input data information includes an input data table identifier;

and/or, the output data information includes an output view identifier, and the input data information includes an input view identifier;

and/or, the output data information includes output data storage location information, and the input data information includes input data storage location information.

In an optional implementation manner, the second system creates a virtual task of the first task, and identifying a target second task dependent on the first task from second tasks in the second system according to the attribute information of the first task further includes:

the second system judges whether a target second task dependent on the first task exists in the second system according to the attribute information of the first task;

and if so, creating a virtual task of the first task.

In an alternative embodiment, the task state includes at least one of the following states: the method comprises the steps of waiting for task execution, completing data output, sleeping the task and deleting the task during task execution.

In an optional implementation, the attribute information of the first task includes: a data output period of the first task;

after the second system creates a virtual task of the first task, the method further includes: and the second system configures a task scheduling period of the virtual task according to the data output period of the first task and schedules the virtual task according to the task scheduling period.

In an optional implementation manner, the scheduling the virtual task according to the task scheduling period further includes: determining a plurality of period units according to the task scheduling period, and generating a virtual task instance corresponding to each period unit;

the task state information of the first task comprises: task state information of the first task in any period unit;

the second system updating the task state of the virtual task according to the task state information of the first task further comprises: and the second system updates the instance state of the virtual task instance corresponding to the corresponding period unit according to the task state information of the first task in any period unit.

In an optional implementation manner, if the task state of the first task is that the first task is in a state of completing data output in a preset period unit;

the first system sending the task state information of the first task to the second system further comprises: the first system sends the information that the first task is in a state of finishing data output in a preset period unit to a second system;

the method further comprises the following steps: the second system determines the receiving time of the information that the first task is in a state of finishing data output in a preset period unit, and compares the receiving time with the termination time of the preset period unit; and if the receiving time is not matched with the termination time, generating alarm information.

In an optional embodiment, the method further comprises: and if the task state of the virtual task is the completion of data output, the second system performs data quality detection on the output data of the first task.

According to a second aspect of the present invention, there is provided a task processing method for a big data platform, the method being performed by a first system in the big data platform, the method comprising:

creating a first task;

sending the attribute information of the first task to a second system; the virtual tasks of the first task are created by a second system, a target second task depending on the first task is identified from second tasks in the second system according to the attribute information of the first task, and the dependency relationship between the target second task and the virtual tasks is established;

and sending the task state information of the first task to a second system so that the second system updates the task state of the virtual task according to the task state information of the first task, and determines whether to execute a target second task having a dependency relationship with the virtual task according to the task state of the virtual task.

According to a third aspect of the present invention, there is provided a task processing method for a big data platform, the method being performed by a second system in the big data platform, the method comprising:

receiving attribute information of a first task sent by a first system, and identifying a target second task depending on the first task from second tasks according to the attribute information of the first task;

creating a virtual task of the first task, and establishing a dependency relationship between the target second task and the virtual task;

receiving task state information of a first task sent by a first system, updating a task state of the virtual task according to the task state information of the first task, and determining whether to execute a target second task having a dependency relationship with the virtual task according to the task state of the virtual task.

the identifying a target second task dependent on the first task from second tasks according to the attribute information of the first task further comprises: and acquiring input data information of a second task, comparing the input data information of the second task with the output data information of the first task, and determining the second task with the input data information matched with the output data information of the first task as a target second task.

In an optional embodiment, the creating the virtual task of the first task further comprises:

judging whether a target second task dependent on the first task exists in the second system or not according to the attribute information of the first task;

and if so, creating a virtual task of the first task.

after the virtual task of the first task is created, the method further includes: and configuring a task scheduling period of the virtual task according to the data output period of the first task, and scheduling the virtual task according to the task scheduling period.

In an optional embodiment, said scheduling the virtual task according to the task scheduling cycle further includes: determining a plurality of period units according to the task scheduling period, and generating a virtual task instance corresponding to each period unit;

the updating the task state of the virtual task according to the task state information of the first task further comprises: and updating the instance state of the virtual task instance corresponding to the corresponding period unit according to the task state information of the first task in any period unit.

the method further comprises: determining the receiving time of the information that the first task is in a state of finishing data output in a preset period unit, and comparing the receiving time with the termination time of the preset period unit; and if the receiving time is not matched with the termination time, generating alarm information.

In an optional embodiment, the method further comprises: and if the task state of the virtual task is the completion of data output, performing data quality detection on the output data of the first task.

According to a fourth aspect of embodiments of the present invention, there is provided a first system, the system comprising:

a creation module for creating a first task;

the sending module is used for sending the attribute information of the first task to a second system; the virtual tasks of the first task are created by a second system, a target second task depending on the first task is identified from second tasks in the second system according to the attribute information of the first task, and the dependency relationship between the target second task and the virtual tasks is established; and sending the task state information of the first task to a second system so that the second system can update the task state of the virtual task according to the task state information of the first task and determine whether to execute a target second task having a dependency relationship with the virtual task according to the task state of the virtual task.

According to a fifth aspect of embodiments of the present invention, there is provided a second system, the system including:

the receiving module is used for receiving attribute information of a first task sent by a first system and receiving task state information of the first task sent by the first system;

the identification module is used for identifying a target second task dependent on the first task from second tasks according to the attribute information of the first task;

a creating module for creating a virtual task of the first task;

the establishing module is used for establishing a dependency relationship between the target second task and the virtual task;

and the updating module is used for updating the task state of the virtual task according to the task state information of the first task and determining whether to execute a target second task which is dependent on the virtual task according to the task state of the virtual task.

the identification module is used for: and acquiring input data information of a second task, comparing the input data information of the second task with the output data information of the first task, and determining the second task with the input data information matched with the output data information of the first task as a target second task.

In an alternative embodiment, the identification module is configured to: and if the input data information of the second task contains the output data information of the first task, determining that the second task is a target second task.

In an alternative embodiment, the creation module is configured to: judging whether a target second task dependent on the first task exists in the second system or not according to the attribute information of the first task;

and if so, creating a virtual task of the first task.

the creation module is to: and configuring a task scheduling period of the virtual task according to the data output period of the first task, and scheduling the virtual task according to the task scheduling period.

In an alternative embodiment, the creation module is configured to: determining a plurality of period units according to the task scheduling period, and generating a virtual task instance corresponding to each period unit;

the update module is to: and updating the instance state of the virtual task instance corresponding to the corresponding period unit according to the task state information of the first task in any period unit.

In an alternative embodiment, the system further comprises: the alarm module is used for finishing the data output state of the first task in a preset period unit if the task state of the first task is the first task; determining the receiving time of the information that the first task is in a state of finishing data output in a preset period unit, and comparing the receiving time with the termination time of the preset period unit; and if the receiving time is not matched with the termination time, generating alarm information.

In an alternative embodiment, the system further comprises: and the quality detection module is used for detecting the data quality of the output data of the first task if the task state of the virtual task is the completion of data output.

According to a sixth aspect of the embodiments of the present invention, there is provided a big data platform, including: the first system and the second system.

According to a seventh aspect of embodiments of the present invention, there is provided a computing device, including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the scheduling method.

According to an eighth aspect of the embodiments of the present invention, a computer storage medium is provided, where at least one executable instruction is stored in the storage medium, and the executable instruction causes a processor to execute operations corresponding to the task processing method of the big data platform.

The embodiment of the invention creates the virtual task of the first task in the first system in the second system and establishes the dependency relationship between the downstream task of the first task and the virtual task in the second system, thereby changing the cross-system task dependency into the same-system task dependency, reducing the system overhead caused by frequent cross-system polling and saving the system resources. And the first system synchronizes the task state of the first task to the second system in time, so that the task state of the virtual task in the second system is consistent with the task state of the first task, and the dependency detection of the downstream task can be triggered in time after the task state of the first task is changed due to the fact that the downstream task and the virtual task establish a dependency relationship, so that the downstream task can be executed quickly, and the task processing efficiency is improved.

According to the embodiment of the invention, the input data information of the second task is compared with the output data information of the first task, so that whether the second task has a dependency relationship with the first task can be accurately determined, and the identification precision of the target second task is improved.

According to the embodiment of the invention, the target second task can be quickly identified according to the inclusion relation between the input data information of the second task and the output data information of the first task.

According to the embodiment of the invention, the target second task can be identified according to the data table identification, the output view identification, the output data storage position information and the like, so that the flexibility of the identification mode of the target second task is improved.

The second system of the embodiment of the invention establishes the virtual task of the first task after determining that the target second task depending on the first task exists, thereby saving the second system resource.

The task state in the embodiment of the invention comprises the following steps: the method has the advantages that the task is to be executed, the data output is completed, the task is dormant, and the task is deleted, so that the execution state of the task can be accurately reflected, and the task can be accurately executed conveniently.

The attribute information of the first task comprises a data output period of the first task, and the second system configures a task scheduling period of the virtual task according to the data output period of the first task and schedules the virtual task according to the task scheduling period, so that the data output period of the first task is matched with the task scheduling period of the corresponding virtual task, and the virtual task can accurately reflect the state of the first task.

The embodiment of the invention determines a plurality of period units according to the task scheduling period, generates the virtual task instance corresponding to each period unit, and updates the instance state of the virtual task instance corresponding to the corresponding period unit according to the task state information of the first task in any period unit. By generating the virtual task instance of each period unit, the execution state of the first task can be embodied in a fine-grained and precise manner.

In the embodiment of the invention, the second system determines the receiving time of the information that the first task is in the state of finishing data output in the preset period unit, and compares the receiving time with the termination time of the preset period unit; and if the receiving time is not matched with the termination time, generating alarm information, thereby realizing the positioning and the alarm of the task execution abnormity.

In the embodiment of the invention, the second system performs data quality detection on the output data of the first task, and ensures accurate execution of the downstream task.

The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and the embodiments of the present invention can be implemented according to the content of the description in order to make the technical means of the embodiments of the present invention more clearly understood, and the detailed description of the embodiments of the present invention is provided below in order to make the foregoing and other objects, features, and advantages of the embodiments of the present invention more clearly understandable.

Drawings

Various additional advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the embodiments of the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a diagram illustrating inter-system task dependency processing according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating another inter-system task dependency processing provided by an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a task processing method of a big data platform according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating a task processing method for a big data platform according to another embodiment of the present invention;

FIG. 5 is a flowchart illustrating a task processing method for a big data platform according to another embodiment of the present invention;

FIG. 6 is a flowchart illustrating a task processing method for a big data platform according to another embodiment of the present invention;

FIG. 7 is a schematic diagram of a first system according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a second system according to an embodiment of the present invention;

FIG. 9 is a schematic structural diagram of a big data platform according to an embodiment of the present invention;

fig. 10 shows a schematic structural diagram of a computing device according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the embodiments to those skilled in the art.

The inventor of the present application finds that if a processing mode of polling an upstream task by a downstream task is adopted, technical disadvantages of low task execution efficiency and high system overhead are caused. For example, system a and system B are two different systems in a big data platform, but task 2 and task 3 in system B both depend on the production data of task 1 in system a, so task 2 and task 3 are downstream tasks of task 1, and task 2 and task 3 have a dependency relationship with task 1. If the method shown in fig. 1 is adopted, task 2 and task 3 in system B poll the task state of task 1 in system a in a cross-system manner, and frequent polling interaction needs to be performed between system a and system B, so that the system overhead of system a and system B is increased, and the task state of task 1 cannot be obtained in time in the cross-system polling manner, so that the task processing efficiency of task 2 and task 3 is also reduced.

To solve the technical problem, as shown in fig. 2, in the embodiment of the present invention, a virtual task of task 1 is created in a system B, and task 1 synchronizes a task state to the virtual task, so that the task state of task 1 can be reflected by the virtual task. And moreover, the dependency relationships between the tasks 2 and 3 in the system B and the virtual tasks are respectively established, so that the original cross-system task dependency is converted into the task dependency in the system, the system overhead of the system A and the system B is reduced, the task 2 and the task 3 can acquire the task state of the task 1 in time, and the task processing efficiency of the task 2 and the task 3 is improved.

The following describes the embodiments of the present invention in detail with reference to various embodiments.

Fig. 3 is a flowchart illustrating a task processing method of a big data platform according to an embodiment of the present invention. The task processing method for the big data platform provided by the embodiment of the invention can be executed by a preset big data platform, and the big data platform comprises a first system and a second system. The flowcharts in the embodiments of the present invention are not used to limit the order of executing the steps. Some steps in the flowchart may be added or deleted as desired.

Specifically, as shown in fig. 3, the method includes the steps of:

in step S310, a first system creates a first task.

The first task is a task created in the first system and can be executed in the first system, and the task in the embodiment of the present invention is composed of one or more data processing steps, and taking the first system as a live broadcast data acquisition system as an example, the first task may be to obtain live broadcast room bullet screen information in real time, and the like.

In step S320, the first system sends the attribute information of the first task to the second system.

The first system and the second system are two different systems, but the execution of at least one task in the second system depends on the task in the first system, i.e. at least one task in the second system is a downstream task of the task in the first system. Thus, the first system transmits the attribute information of the first task to the second system after creating the first task.

In an alternative embodiment, the attribute information of the first task includes, but is not limited to, one or more of the following: output data information of the first task, a data output period of the first task, a task identification and a system identification of the first task, and the like. The output data information of the first task specifically refers to relevant information of data produced by the first task, for example, the output data information may be an output table identifier, a view identifier, output data storage location information, and the like; the data output period of the first task is specifically the period corresponding to the output data of the first task; the task identifier of the first task is a unique identifier of the first task in the first system, and the system identifier of the first task is specifically a first system identifier.

Step S330, the second system creates a virtual task of the first task, identifies a target second task depending on the first task from the second tasks in the second system according to the attribute information of the first task, and establishes a dependency relationship between the target second task and the virtual task.

The virtual task of the first task is a task that is created in the second system and can be executed in the second system. The virtual task of the first task is a mapping of the first task in the second system, and the virtual task of the first task does not execute a data processing flow of the first task in the second system, but is used for mapping a relevant state of the first task in the second system. The virtual task of the first task is thus a task that requires only little runtime and system resources.

The second task is a task which is created in the second system and can run in the second system, and the target second task which is dependent on the first task is identified from the second tasks in the second system according to the attribute information of the first task. That is, the target second task is screened from the second tasks in the second system, and the target second task depends on the first task, in other words, the target second task is a downstream task of the first task, and the execution of the target second task depends on the output data of the first task.

In an alternative embodiment, a target second task of the first task may be identified using a combination of one or more of the following identification means;

the identification method is as follows: and if the attribute information of the first task comprises the output data information of the first task, acquiring the input data information of a second task in the second system, comparing the input data information of the second task with the output data information of the first task, and determining the second task with the input data information matched with the output data information of the first task as a target second task. In the identification mode, the input information of the target second task is matched with the output information of the first task, and the target second task is identified according to the output data information of the first task and the input data information of the second task, so that whether the blood relationship exists between the first task and the second task can be accurately determined, and the target second task can be accurately identified.

Further, the determining, as the target second task, the second task whose input data information matches the output data information of the first task specifically includes: and if the input data information of the second task contains the output data information of the first task, determining that the second task is the target second task. Specifically, if the input data information of the second task includes the output data information of the first task, it indicates that the data input during execution of the second task includes the data output by the first task, so as to determine that the second task needs to be executed after the data output by the first task, and the second task is a downstream task of the first task, so as to determine that the second task is the target second task.

The output data information in the embodiment of the present invention includes one or more of the following types of information:

the type one is as follows: the output data information includes an output data table identifier, and the input data information includes an input data table identifier. Specifically, the output data table identifier is specifically an identifier of a data table in which data output by the task is located, and the input data table identifier is specifically an identifier of a data table in which data input by the task is located. In a specific implementation process, if a certain input data table identifier of the second task is consistent with a certain output data table identifier of the first task, indicating that the input data information of the second task contains the output data information of the first task, determining that the second task is the target second task. According to the method, the target second task is identified through the data table identification, the data table dependency of the target second task and the virtual task can be realized subsequently, and the accuracy of data processing is guaranteed.

Type two: the output data information comprises an output view identification and the input data information comprises an input view identification. Specifically, the output view identifier is specifically an identifier of a view in which data output by the task is located, and the input view identifier is specifically an identifier of a view in which data input by the task is located. A view is a result set of one or more data tables combined according to some condition. In a specific implementation process, if a certain input view identifier of the second task is consistent with a certain output view identifier of the first task, which indicates that the input data information of the second task contains the output data information of the first task, the second task is determined to be a target second task. In the method, the target second task is identified through the view identifier, so that view dependence of the target second task and the virtual task can be realized subsequently, complexity of the data table under the view can be hidden, when the output table of the first task changes, only the output table corresponding to the view needs to be changed, and dependence of the target second task does not need to be modified again, so that expandability of the method is improved, and change efficiency is improved.

Type three: the output data information includes output data storage location information and the input data information includes input data storage location information. Specifically, the output data storage location information is a storage location of data output by the task, and the input data storage location information is a storage location of data input by the task. And if the information of a certain input data storage position of the second task is consistent with the information of a certain output data storage position of the first task, the input data information of the second task is indicated to contain the output data information of the first task, and the second task is determined to be the target second task. According to the method, the target second task is identified through the storage position, so that the storage position dependence of the target second task and the virtual task can be realized subsequently, and when the output table of the first task changes, the dependence of the target second task does not need to be modified again, so that the expandability of the method is improved, and the change efficiency is improved.

And a second identification mode: a task dependency table of the second task is generated in the second system in advance, and the task dependency table can be configured by a task configuration person. The task dependency table of the second task includes a system identifier and a task identifier of each task on which the second task depends. In the identification method, the attribute information of the first task includes the system identifier and the task identifier of the first task, so that when the target second task is identified, the task dependency table is searched, and the second task in which the system identifier and the task identifier of the task in the task dependency table are consistent with the system identifier and the task identifier of the first task is determined as the target second task. That is, the system identifier of the task on which the target second task depends in the task dependency table is the system identifier of the first system, and the task identifier of the task on which the target second task depends in the task dependency table is the task identifier of the first task in the first system. By adopting the mode, the target second task can be rapidly identified, and the overall task processing efficiency is improved.

In another optional implementation manner, in order to avoid resource waste of the second system, in this implementation manner, after receiving the attribute information of the first task, the second system first determines, according to the attribute information of the first task, whether a target second task dependent on the first task exists in the second system, and if yes, creates a virtual task of the first task; and if not, not creating the virtual task of the first task. Therefore, after the downstream task of the first task does not exist in the second system, the virtual task of the first task is not created any more, so that excessive invalid virtual tasks are avoided, and the resources of the second system are saved.

In yet another alternative implementation, after the virtual task of the first task is created by the second system, the attribute information of the virtual task is generated according to the attribute information of the first task. For example, the output data information of the first task may be used as the output data information of the virtual task, and since the output data information of the virtual task is the same as the output information of the first task, the second system may also identify the target second task corresponding to the first task according to the output data information of the virtual task.

Further, after identifying a target second task of the first task and creating a virtual task of the first task, establishing a dependency relationship of the target second task and the virtual task. Establishing the dependency relationship between the target second task and the virtual task specifically includes configuring the target second task as a downstream task of the virtual task. After the dependency relationship is established, the target second task can be quickly executed after the data output by the virtual task is determined. Therefore, the original cross-system task dependence of the target second task and the first task is converted into the same-system task dependence of the target second task and the virtual task, the system overhead caused by frequent cross-system polling is avoided, and the task processing efficiency is improved.

In step S340, the first system sends the task state information of the first task to the second system.

Task state information is information that describes the state the task is currently in. In the embodiment of the present invention, the task state includes at least one of the following states: the method comprises the steps of waiting for execution of a task, executing the task, outputting finished data, sleeping the task, deleting the task and the like. The task to be executed refers to a state of waiting for execution after the task is created, and the task to be executed indicates that the task is not dormant and is not deleted; the task execution means that the task is in the data processing process; finishing data output refers to that the task outputs corresponding data; the task dormancy means that the task is in a dormant state and is not executed; task deletion means that the task has been deleted.

In an optional implementation manner, in order to save information transmission overhead between the first system and the second system, after the task state of the first task is changed, the first system sends the changed corresponding task state information to the second system.

And step S350, the second system updates the task state of the virtual task according to the task state information of the first task, and determines whether to execute a target second task having a dependency relationship with the virtual task according to the task state of the virtual task.

And the second system updates the task state of the virtual task of the first task according to the task state information of the first task, so that the task state of the virtual task of the first task is consistent with the task state of the first task, and the accurate mapping of the virtual task and the state of the first task is realized. For example, the first task and the dummy task are in a task-to-be-executed state by default after creation. After a first task in the first system is executed, the task state of the first task is changed from task waiting execution to task execution, task state information of the first task in task execution is generated, and the task state information is sent to a second system by the first system. After receiving the task state information, the second system changes the virtual task of the first task from the task waiting execution to the task execution; for another example, after the first task completes data output, the task state of the first task is changed from task execution to complete data output, and task state information that the first task is in the state of completing data output is generated, and the task state information is sent from the first system to the second system. After receiving the task state information, the second system changes the task state of the virtual task of the first task from task execution to completion data output; for another example, after the first task is deleted by the first system, task state information in which the first task is in a task deletion state is generated, and the task state information is sent from the first system to the second system. And after receiving the task state information, the second system deletes the virtual task of the first task, so that the task state of the virtual task is changed into task deletion.

Because the dependency relationship between the target second task and the virtual task of the first task is established, whether to execute the target second task can be determined according to the task state of the virtual task. Specifically, after the task state of the virtual task is updated, if the task state of the virtual task is changed to complete data output, the dependency detection of the target second task is triggered to judge whether the execution condition of the target second task is currently met, and if yes, the target second task is executed.

Therefore, the virtual task of the first task in the first system is created in the second system, and the dependency relationship between the downstream task (namely the target second task) of the first task in the second system and the virtual task is established, so that the cross-system task dependency is changed into the same-system task dependency, the system overhead caused by frequent cross-system polling is reduced, and the system resources are saved. And the first system synchronizes the task state of the first task to the second system in time, so that the task state of the virtual task in the second system is consistent with the task state of the first task, and the target second task and the virtual task establish a dependency relationship, so that the dependency detection of the target second task can be triggered in time after the task state of the first task is changed, the target second task can be executed quickly, and the task processing efficiency is improved.

Fig. 4 is a schematic flowchart illustrating a task processing method of another big data platform according to an embodiment of the present invention. The task processing method of the big data platform provided by the embodiment of the invention can be executed by a preset big data platform, and the big data platform comprises a first system and a second system. The flowcharts in the embodiments of the present invention are not used to limit the order of executing the steps. Some steps in the flowchart may be added or deleted as desired.

Specifically, as shown in fig. 4, the method includes the steps of:

in step S410, a first system creates a first task.

Step S420, the first system sends the attribute information of the first task to the second system; the attribute information includes output data information and a data output period.

In actual implementation, the task of outputting data periodically usually occurs. Taking a task of acquiring the live broadcast room barrage information in real time as an example, the task can output data by hour, namely, the collected information is output to be stored every time the barrage information released by a user within 1 hour is collected. Accordingly, there may be tasks that are performed periodically, such as periodically analyzing bullet-screen information generated every hour, and the like.

The first system in this embodiment may be a data acquisition system, and there are many tasks of periodically outputting data in the data acquisition system. The second system in this embodiment may be a job scheduling system in a big data platform, and the job scheduling system is mainly used to start a correct task at a correct time point, so as to ensure that the task is executed timely and accurately according to a correct dependency relationship. There are typically more tasks that are executed periodically in the job scheduling system. Therefore, the embodiment of the invention can be applied to a data acquisition system and an operation scheduling system in a big data platform. The present embodiment mainly processes a first task that outputs data periodically and a target second task that is executed periodically.

Specifically, the attribute information in this embodiment further includes a data output period in addition to the output data information. The data output period may be weekly, daily, hourly, minute, etc.

Step S430, the second system creates a virtual task of the first task, configures a task scheduling period of the virtual task according to the data output period of the first task, and schedules the virtual task according to the task scheduling period; and identifying a target second task depending on the first task from second tasks in a second system according to the output data information of the first task, and establishing a dependency relationship between the target second task and the virtual task.

In this embodiment, besides establishing the dependency relationship between the target second task and the virtual task, the task scheduling period of the virtual task is further configured according to the data output period of the first task, so that the task scheduling period of the virtual task is consistent with the data output period of the first task. For example, if the data output period of the first task is every hour, the task scheduling period of the dummy task is also every hour. The task scheduling period of the virtual task is consistent with the data output period of the first task, so that accurate execution of the task can be guaranteed.

In an optional implementation manner, the scheduling the virtual task according to the task scheduling cycle specifically includes: and determining a plurality of period units according to the task scheduling period, and generating a virtual task instance corresponding to each period unit. In the actual implementation process, a scheduling cron expression of the virtual task is generated according to the task scheduling period, and the virtual task is scheduled based on the scheduling cron expression. Specifically, a plurality of cycle units may be determined according to the task scheduling cycle, each cycle unit corresponding to a period having a length that coincides with the task scheduling cycle. For example, if the task scheduling period is hourly, a 1:00-2:00,2:00-3:00 … … cycle units, each cycle unit having a time length of 1 hour. Each period unit corresponds to a virtual task instance, wherein the virtual task instance is a result after the virtual task is instantiated, and the virtual task instance is a minimum execution unit of the virtual task. For example, the cycle unit 1:00-2:00 corresponds to virtual task instance 1, cycle unit 2:00-3:00 corresponds to virtual task instance 2, and so on.

Step S440, the first system sends task state information of the first task to the second system; the task state information of the first task comprises the task state information of the first task in any period unit.

Specifically, since the first task is to output data periodically, the first task may also be divided into a plurality of cycle units according to its data output cycle, and the cycle units coincide with the cycle units of the task scheduling cycle of the dummy task. The task state information of the first task includes task state information of the first task in any period unit. For example, if the first task currently acquires periodic unit 1:00-2:00, then the first task is to generate the bullet screen data in the period unit 1:00-2:00 is in task execution; if the first task currently outputs cycle unit 1:00-2:00, then the first task is to change the state of the bullet screen data in cycle unit 1:00-2: the task state of 00 is to complete data output.

Step S450, the second system updates the task state of the virtual task according to the task state information of the first task.

Specifically, the second system updates the instance state of the virtual task instance corresponding to the corresponding cycle unit according to the task state information of the first task in any cycle unit. Example states include: generated, instance executing, and instance completing. The generated representation indicates that the current virtual task instance has been generated and has not been executed; the instance execution indicates that the virtual task instance is currently executed; instance completion indicates virtual task instance completion. For example, the first task is in periodic unit 1:00-2:00 is to complete data output, the second system, based on receiving the task state information of the first task, will cycle unit 1:00-2: the virtual task instance corresponding to 00 is changed to instance completion.

In an optional implementation manner, if the task state of the first task is that the first task is in a data output completion state in a preset period unit, the first system sends information that the first task is in the data output completion state in the preset period unit to the second system; the second system determines the receiving time of the information that the first task is in the state of finishing data output in the preset period unit, and compares the receiving time with the termination time of the preset period unit; and if the receiving time is not matched with the termination time, generating alarm information. Specifically, if the receiving time is earlier than the terminating time, it indicates that the data output by the first task in the preset period unit are incomplete, so as to generate corresponding alarm information. Or, if the receiving time is later than the terminating time and the interval between the receiving time and the terminating time exceeds the preset threshold, it indicates that there is a large delay in the data output of the first task, thereby triggering an alarm.

In yet another optional implementation manner, if the task state of the virtual task is that data output is completed, the second system performs data quality detection on the output data of the first task, thereby ensuring accurate execution of the downstream task. Wherein the data quality detection includes but is not limited to: data integrity checks (e.g., checking if there is a large amount of null data), data specification 6 checks (e.g., checking if the data format is specified), and so on. If the output data of the first task fails to pass the data quality detection, the dependency detection of the target second task is not triggered, and corresponding alarm information can be generated; and if the output data of the first task passes the data quality detection, triggering the dependency detection of the target second task.

Further optionally, when the second system is a job scheduling system for big data, because the job scheduling system usually includes a DQC (data quality detection) module and/or an SLA (baseline alarm) module, the second system may invoke the DQC module to implement the data quality detection on the output data of the first task, and the data quality detection may further include a process of comparing the receiving time with the termination time of the preset period unit; and/or calling an SLA module to realize the generation of the alarm information. Therefore, the reuse of the existing functions in the job scheduling system is realized, and the development cost is saved.

Therefore, the embodiment of the invention configures the task scheduling period of the virtual task according to the data output period of the first task, and schedules the virtual task according to the task scheduling period, thereby ensuring the accurate execution of the task; and the second system updates the instance state of the virtual task instance corresponding to the corresponding period unit according to the task state information of the first task in any period unit, so that the execution precision of the task is further improved.

Fig. 5 is a flowchart illustrating a task processing method for a big data platform according to another embodiment of the present invention. The task processing method of the big data platform provided by the embodiment of the invention can be executed by a first system in a preset big data platform. The flowcharts in the embodiments of the present invention are not used to limit the order of executing the steps. Some steps in the flowchart may be added or deleted as desired.

Specifically, as shown in fig. 5, the method includes the steps of:

step S510, a first task is created.

Step S520, the attribute information of the first task is sent to a second system; the virtual tasks of the first task are created by the second system, the target second task depending on the first task is identified from the second tasks in the second system according to the attribute information of the first task, and the dependency relationship between the target second task and the virtual tasks is established.

Step S530, sending the task state information of the first task to the second system, so that the second system updates the task state of the virtual task according to the task state information of the first task, and determines whether to execute a target second task having a dependency relationship with the virtual task according to the task state of the virtual task.

The specific implementation process of the embodiment of the present invention may refer to descriptions in other method embodiments, which are not described herein again.

Therefore, the embodiment of the invention changes the cross-system task dependence into the same-system task dependence, reduces the system overhead caused by frequent cross-system polling and saves the system resources. And the task state of the first task is timely synchronized to the second system, so that the task state of the virtual task in the second system is consistent with the task state of the first task, the target second task is conveniently and quickly executed, and the task processing efficiency is improved.

Fig. 6 is a flowchart illustrating a task processing method for a big data platform according to another embodiment of the present invention. The task processing method of the big data platform provided by the embodiment of the invention can be executed by a second system in a preset big data platform. The flowcharts in the embodiments of the present invention are not used to limit the order of executing the steps. Some steps in the flowchart may be added or deleted as desired.

Specifically, as shown in fig. 6, the method includes the steps of:

step S610, receiving attribute information of the first task sent by the first system, and identifying a target second task dependent on the first task from the second tasks according to the attribute information of the first task.

Step S620, a virtual task of the first task is created, and a dependency relationship between the target second task and the virtual task is established.

Step S630, receiving task state information of the first task sent by the first system, updating a task state of the virtual task according to the task state information of the first task, and determining whether to execute a target second task having a dependency relationship with the virtual task according to the task state of the virtual task.

judging whether a target second task depending on the first task exists in the second system or not according to the attribute information of the first task;

and if so, creating a virtual task of the first task.

Therefore, the embodiment of the invention changes the cross-system task dependence into the same-system task dependence, reduces the system overhead caused by frequent cross-system polling and saves the system resources. And the task state of the first task can be synchronized to the virtual task in time, so that the target second task can be executed quickly, and the task processing efficiency is improved.

Fig. 7 shows a schematic structural diagram of a first system according to an embodiment of the present invention. As shown in fig. 7, the first system 700 includes: a creation module 710, and a sending module 720.

A creating module 710 for creating a first task;

a sending module 720, configured to send the attribute information of the first task to a second system; the virtual tasks of the first task are created by a second system, a target second task depending on the first task is identified from second tasks in the second system according to the attribute information of the first task, and the dependency relationship between the target second task and the virtual tasks is established; and sending the task state information of the first task to a second system, so that the second system updates the task state of the virtual task according to the task state information of the first task and determines whether to execute the target second task according to the task state of the virtual task.

Fig. 8 is a schematic structural diagram of a second system according to an embodiment of the present invention. As shown in fig. 8, the second system 800 includes: a receiving module 810, an identifying module 820, a creating module 830, a establishing module 840, and an updating module 850.

A receiving module 810, configured to receive attribute information of a first task sent by a first system, and receive task state information of the first task sent by the first system;

an identifying module 820, configured to identify a target second task dependent on the first task from second tasks according to the attribute information of the first task;

a creating module 830, configured to create a virtual task of the first task;

an establishing module 840, configured to establish a dependency relationship between the target second task and the virtual task;

an updating module 850, configured to update the task state of the virtual task according to the task state information of the first task, and determine whether to execute the target second task according to the task state of the virtual task.

the identification module is to: and acquiring input data information of a second task, comparing the input data information of the second task with the output data information of the first task, and determining the second task with the input data information matched with the output data information of the first task as a target second task.

and if so, creating a virtual task of the first task.

Fig. 9 shows a schematic structural diagram of a big data platform according to an embodiment of the present invention. As shown in fig. 9, a big data platform 900 includes a first system 700 and a second system 800.

Fig. 10 is a schematic structural diagram of a computing device according to an embodiment of the present invention. The specific embodiments of the present invention do not limit the specific implementation of the computing device.

As shown in fig. 10, the computing device may include: a processor (processor) 1002, a communication Interface 1004, a memory 1006, and a communication bus 1008.

Wherein: the processor 1002, communication interface 1004, and memory 1006 communicate with each other via a communication bus 1008. A communication interface 1004 for communicating with network elements of other devices, such as clients or other servers. The processor 1002 is configured to execute the program 1010, and may specifically perform relevant steps in the above embodiment of the task processing method for a big data platform.

In particular, the program 1010 may include program code that includes computer operating instructions.

The processor 1002 may be a central processing unit CPU, or an Application Specific Integrated Circuit ASIC (Application Specific Integrated Circuit), or one or more Integrated circuits configured to implement an embodiment of the present invention. The computing device includes one or more processors, which may be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

The memory 1006 is used for storing the program 1010. The memory 1006 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory. The program 1010 may be specifically adapted to cause the processor 1002 to execute the method in any of the above-described method embodiments.

The embodiment of the invention provides a nonvolatile computer storage medium, wherein at least one executable instruction is stored in the computer storage medium, and the computer executable instruction can execute the task processing method of the big data platform in any method embodiment.

The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of embodiments of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best modes of embodiments of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that is, the claimed embodiments of the invention require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of an embodiment of this invention.

Those skilled in the art will appreciate that the modules in the devices in an embodiment may be adaptively changed and arranged in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of and form different embodiments of the invention. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components according to embodiments of the present invention. Embodiments of the invention may also be implemented as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing embodiments of the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit embodiments of the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. Embodiments of the invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means can be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specified otherwise.

Claims

1. A task processing method of a big data platform is characterized in that the big data platform comprises a first system and a second system, and the method comprises the following steps:

2. The method of claim 1, wherein the attribute information of the first task comprises: output data information of the first task;

3. The method of claim 2, wherein determining the second task that matches the input data information with the output data information of the first task as a target second task further comprises:

4. The method of claim 3,

if the output data information comprises an output data table identifier, the input data information comprises an input data table identifier;

5. The method of any of claims 1-4, wherein the second system creates virtual tasks for the first task, and wherein identifying a target second task from second tasks in the second system that depends on the first task based on the attribute information of the first task further comprises:

and if so, creating a virtual task of the first task.

6. The method according to any of claims 1-5, wherein the task state comprises at least one of: the method comprises the steps of waiting for task execution, completing data output, sleeping the task and deleting the task during task execution.

7. The method of any of claims 1-6, wherein the attribute information of the first task comprises: a data output period of the first task;

8. The method of claim 7, wherein said scheduling the virtual task according to the task scheduling period further comprises: determining a plurality of period units according to the task scheduling period, and generating a virtual task instance corresponding to each period unit;

9. The method of claim 8, wherein if the task status of the first task is that the first task is in a data output completion status in a preset period unit;

10. The method according to any one of claims 1-9, further comprising: and if the task state of the virtual task is the completion of data output, the second system performs data quality detection on the output data of the first task.

11. A method for task processing of a big data platform, the method being performed by a first system in the big data platform, the method comprising:

creating a first task;

12. A method for task processing of a big data platform, the method being performed by a second system in the big data platform, the method comprising:

13. The method of claim 12, wherein the attribute information of the first task comprises: output data information of the first task;

14. The method of claim 13, wherein determining a second task that matches input data information with output data information of the first task as a target second task further comprises:

15. The method according to any of claims 12-14, wherein the creating the virtual task of the first task further comprises:

and if so, creating a virtual task of the first task.

16. The method of any of claims 12-15, wherein the attribute information of the first task comprises: a data output period of the first task;

17. The method of claim 16, wherein said scheduling the virtual task according to the task scheduling period further comprises: determining a plurality of period units according to the task scheduling period, and generating a virtual task instance corresponding to each period unit;

18. The method according to claim 17, wherein if the task status of the first task is that the first task is in a data output completion status in a unit of a preset period;

19. The method according to any one of claims 12-18, further comprising: and if the task state of the virtual task is the completion of data output, performing data quality detection on the output data of the first task.

20. A first system, characterized in that the system comprises:

a creation module for creating a first task;

21. A second system, characterized in that the system comprises:

a creating module for creating a virtual task of the first task;

22. A big data platform, comprising: a first system as claimed in claim 20 and a second system as claimed in claim 21.

23. A computing device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface are communicated with each other through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the task processing method of the big data platform according to any one of claims 11-19.

24. A computer storage medium, wherein at least one executable instruction is stored in the storage medium, and the executable instruction causes a processor to execute the operation corresponding to the task processing method of the big data platform according to any one of claims 11 to 19.