CN113791902A

CN113791902A - Data processing method and device, electronic equipment and storage medium

Info

Publication number: CN113791902A
Application number: CN202111013855.4A
Authority: CN
Inventors: 金峙廷
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2021-12-14

Abstract

The present disclosure relates to a data processing method, an apparatus, an electronic device, and a storage medium, the method comprising: determining a plurality of nodes to be confirmed based on the operation state parameters of the data processing nodes; determining an overloaded node from the plurality of nodes to be confirmed; and determining a processing method for the data processing tasks on the overloaded nodes according to the ratio of the number of the overloaded nodes to the number of the data processing nodes in the current data processing cluster. The method and the device can solve the problem that the calculation pressure of an operator is relieved through hardware expansion, so that hardware resources are excessively consumed.

Description

Data processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a data processing method and apparatus, an electronic device, and a storage medium.

Background

At present, with the explosive growth of data, data processing and calculation are increasingly dependent on large data platforms. For a big data platform, real-time data processing and offline data processing are divided; off-line data processing does not make too high a requirement on its performance because of some time buffering; for real-time large data platform processing, the data pressure born by the real-time large data platform is very large, and a large amount of data needs to be subjected to handling, calculation and output. Flink is a framework and distributed processing engine for stateful computation of unbounded and bounded data streams, which can run in all common clustered environments. In real-time large data platform processing, Flink has excellent performance, can access real-time large-flow data to perform high throughput and low delay data processing, and provides abundant API interfaces. However, when the calculation capability of the flank operator has a bottleneck in the face of a sudden increase in data traffic, a calculation delay is caused.

In the related art, when the calculation pressure of an operator is found to be large, operation and maintenance personnel are required to perform manual capacity expansion, and the calculation pressure of the current operator is relieved by increasing hardware resources, but a new machine needs to be provided by the method, so that excessive hardware resource consumption is caused.

Disclosure of Invention

The present disclosure provides a data processing method, an apparatus, an electronic device, and a storage medium, to at least solve a problem of excessive consumption of hardware resources caused by relieving operator computation pressure through hardware expansion in the related art. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a data processing method, including:

determining a plurality of nodes to be confirmed based on the operation state parameters of the data processing nodes;

determining an overloaded node from the plurality of nodes to be confirmed;

and determining a processing method for the data processing tasks on the overloaded nodes based on the ratio of the number of the overloaded nodes to the number of the data processing nodes in the current data processing cluster.

In an exemplary embodiment, the operational status data includes input end network cache usage, and output end network cache usage;

the determining a plurality of nodes to be confirmed based on the operating state parameters of the data processing nodes comprises:

determining a data processing cluster corresponding to the current traversal cycle;

determining the node to be confirmed in the current traversal period from the data processing cluster corresponding to the current traversal period; wherein the input end network cache utilization rate of the node to be confirmed is greater than the output end network cache utilization rate.

In an exemplary embodiment, the determining an overloaded node from the plurality of nodes to be confirmed includes:

and determining the data processing node which is determined as the node to be confirmed in a continuous preset number of traversal cycles as the overload node.

In an exemplary embodiment, the method for determining to process the data processing task on the overloaded node based on the ratio of the number of the overloaded nodes to the number of data processing nodes in the current data processing cluster includes:

and when the ratio of the number of the overloaded nodes to the number of the data processing nodes in the current data processing cluster is less than or equal to a preset value, determining that the data processing task is processed based on the data processing nodes in the current data processing cluster.

and when the ratio of the number of the overload nodes to the number of the data processing nodes in the current data processing cluster is larger than a preset value, determining to create a new node for the current data processing cluster based on preset hardware resources, and processing the data processing task based on the data processing nodes in the data processing cluster after the new node is added.

In an exemplary embodiment, the processing the data processing task based on the data processing node in the current data processing cluster includes:

traversing the overloaded nodes, performing the following operations based on each of the overloaded nodes:

performing task splitting on a data processing task at a current overload node to obtain a plurality of subtasks;

processing the plurality of subtasks based on the current overloaded node and at least one candidate node; the at least one candidate node is a data processing node in the current data processing cluster except the current overloaded node.

In an exemplary embodiment, the processing the data processing task based on the data processing node in the data processing cluster after the node is newly added includes:

determining idle resources in the preset hardware resources, and creating the new node in the idle resources;

determining the data processing cluster containing the newly added node as a current data processing cluster;

In an exemplary embodiment, the processing the plurality of subtasks based on the currently overloaded node and at least one candidate node comprises:

assigning the plurality of subtasks to the currently overloaded node, and at least one candidate node; wherein the current overloaded node is different from the subtask allocated to at least one of the candidate nodes;

sending a first processing result obtained by processing the distributed subtasks by the current overload node and a second processing result obtained by processing the distributed subtasks by at least one candidate node to an aggregation node, so that the aggregation node obtains a data processing result corresponding to the data processing task based on the first processing result and the second processing result;

and the aggregation node is a data processing node except the current overload node in the current data processing cluster.

In an exemplary embodiment, the assigning the plurality of subtasks to the currently overloaded node, and at least one candidate node includes:

generating an additional identifier corresponding to each of the subtasks;

determining a subtask corresponding to the current overload node and a subtask corresponding to at least one candidate node based on the additional identifier corresponding to each subtask;

and distributing the subtask corresponding to the current overload node, and distributing the subtask corresponding to at least one candidate node.

In an exemplary embodiment, the processing result of each sub-task carries the task identifier of the data processing task;

the method further comprises the following steps:

extracting a processing result of the task identifier carried by the current overload node, and determining the processing result of the task identifier carried by the current overload node as the first processing result;

and extracting a processing result carrying the task identifier at least one candidate node, and determining the processing result carrying the task identifier at the at least one candidate node as the second processing result.

According to a second aspect of the embodiments of the present disclosure, there is provided a data processing apparatus including:

a node to be confirmed determination unit configured to perform determination of a plurality of nodes to be confirmed based on the operation state parameter of the data processing node;

an overloaded node determination unit configured to perform determination of an overloaded node from the plurality of nodes to be confirmed;

and the processing method determining unit is configured to execute a processing method for determining the data processing task on the overloaded node based on the ratio of the number of the overloaded nodes to the number of the data processing nodes in the current data processing cluster.

the to-be-confirmed node determination unit includes:

a first determination unit configured to perform determination of a data processing cluster corresponding to a current traversal cycle;

the second determining unit is configured to determine the node to be confirmed in the current traversal cycle from the data processing cluster corresponding to the current traversal cycle; wherein the input end network cache utilization rate of the node to be confirmed is greater than the output end network cache utilization rate.

In an exemplary embodiment, the overload node determining unit includes:

a third determination unit configured to perform determination of a data processing node determined as a node to be confirmed within a consecutive preset number of traversal cycles as the overloaded node.

In an exemplary embodiment, the processing method determination unit includes:

and the first processing method determining unit is configured to determine to process the data processing task based on the data processing node in the current data processing cluster when the ratio of the number of the overloaded nodes to the number of the data processing nodes in the current data processing cluster is less than or equal to a preset value.

In an exemplary embodiment, the processing method determination unit includes:

and the second processing method determining unit is configured to determine to create a new node for the current data processing cluster based on preset hardware resources when the ratio of the number of the overloaded nodes to the number of the data processing nodes in the current data processing cluster is greater than a preset value, and process the data processing task based on the data processing nodes in the data processing cluster after the new node is added.

In an exemplary embodiment, the first processing method determination unit includes:

a first traversal unit configured to perform traversing the overloaded nodes, performing the following operations based on each of the overloaded nodes:

the first splitting unit is configured to perform task splitting on a data processing task at a current overload node to obtain a plurality of subtasks;

a first processing unit configured to perform processing of the plurality of subtasks based on the currently overloaded node and at least one candidate node; the at least one candidate node is a data processing node in the current data processing cluster except the current overloaded node.

In an exemplary embodiment, the second processing method determination unit includes:

the node creating unit is configured to determine idle resources in the preset hardware resources and create the new node in the idle resources;

a fourth determination unit configured to perform determination of the data processing cluster containing the newly added node as the current data processing cluster; a second traversal unit configured to perform traversal of the overloaded nodes, performing the following operations based on each of the overloaded nodes:

the second splitting unit is configured to perform task splitting on the data processing task at the current overload node to obtain a plurality of subtasks;

a second processing unit configured to perform processing of the plurality of subtasks based on the currently overloaded node and at least one candidate node; the at least one candidate node is a data processing node in the current data processing cluster except the current overloaded node.

In an exemplary embodiment, the first processing unit or the second processing unit includes:

a task allocation unit configured to perform allocation of the plurality of subtasks to the currently overloaded node, and at least one candidate node; wherein the current overloaded node is different from the subtask allocated to at least one of the candidate nodes;

the result aggregation unit is configured to execute a first processing result obtained by processing the allocated subtasks by the current overload node and a second processing result obtained by processing the allocated subtasks by at least one candidate node, and send the first processing result and the second processing result to an aggregation node, so that the aggregation node obtains a data processing result corresponding to the data processing task based on the first processing result and the second processing result;

In an exemplary embodiment, the task allocation unit includes:

an additional identifier generating unit configured to perform generating an additional identifier corresponding to each of the subtasks;

a fourth determining unit, configured to perform determining a subtask corresponding to the current overloaded node and a subtask corresponding to at least one of the candidate nodes based on the additional identifier corresponding to each of the subtasks;

a second allocating unit configured to perform allocation of a subtask corresponding to the current overloaded node, and allocate a subtask corresponding to at least one of the candidate nodes.

the device further comprises:

a first extraction unit, configured to extract a processing result carrying the task identifier at the current overloaded node, and determine the processing result carrying the task identifier at the current overloaded node as the first processing result;

a second extracting unit, configured to perform extraction of a processing result that carries the task identifier at least one of the candidate nodes, and determine the processing result that carries the task identifier at the at least one of the candidate nodes as the second processing result.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the data processing method as described above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium in which instructions, when executed by a processor of a server, enable the server to perform the data processing method as described above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the method comprises the steps of firstly determining a plurality of nodes to be confirmed based on operating parameters of data processing nodes, and then determining overload nodes from the plurality of nodes to be confirmed; determining a processing method for the data processing tasks on the overloaded nodes based on the ratio of the number of the overloaded nodes to the number of the data processing nodes in the current data processing cluster; the data processing nodes in the data processing cluster are created based on preset hardware resources corresponding to the data processing cluster, namely the preset hardware resources corresponding to the data processing cluster are not changed, so that the data processing tasks on the overload nodes are processed based on the preset hardware resources, the data processing tasks on the overload nodes are not required to be processed in a mode of increasing the hardware resources, the consumption of excessive hardware resources can be avoided, and the resource use efficiency is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a schematic diagram of a data processing cluster, shown in accordance with an exemplary embodiment.

FIG. 2 is a flow diagram illustrating a method of data processing according to an example embodiment.

Fig. 3 is a flowchart illustrating a method for determining a node to be confirmed according to an exemplary embodiment.

FIG. 4 is a flowchart illustrating a method of determining a processing method for a data processing task in accordance with an exemplary embodiment.

FIG. 5 is a flowchart illustrating a method of processing a data processing task in accordance with an exemplary embodiment.

FIG. 6 is a flow diagram illustrating another method of processing a data processing task in accordance with an illustrative embodiment.

FIG. 7 is a flowchart illustrating a subtask processing method according to an exemplary embodiment.

FIG. 8 is a flowchart illustrating a method of task assignment according to an example embodiment.

FIG. 9 is a flowchart illustrating a method of processing result generation in accordance with an exemplary embodiment.

FIG. 10 is a block diagram illustrating a data processing apparatus according to an example embodiment.

Fig. 11 is a schematic diagram illustrating an apparatus configuration according to an exemplary embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Please refer to fig. 1, which illustrates a schematic diagram of a data processing cluster provided in an embodiment of the present disclosure, where the schematic diagram may include a cluster management device 110 and at least one data processing device 120, each data processing device 120 includes a plurality of data processing nodes, and the cluster management device 110 may distribute tasks to the plurality of data processing devices 120, so that the data processing nodes in the data processing devices 120 process received tasks; the cluster management device 110 may also apply for and allocate resources to implement adding a new data processing node to the current data processing cluster to perform task processing; the cluster management device 110 and the data processing device 120 may be specifically devices such as servers.

Each data processing node may correspond to a part of hardware resources on one data processing device, that is, each data processing device may include at least one data processing node, and the data processing device may allocate a corresponding computing resource to each data processing node.

Flink is a framework and distributed processing engine for stateful computation of unbounded and bounded data streams. Flink is designed to run in all common clustered environments, performing computations at memory execution speeds and arbitrary scales.

In order to solve the problem of excessive consumption of hardware resources caused by relieving operator computation pressure through hardware expansion in the related art, an embodiment of the present disclosure provides a data processing method, please refer to fig. 2, an execution subject of the method may be a cluster management device in the data processing cluster of fig. 1, and the method may include:

and S210, determining a plurality of nodes to be confirmed based on the operation state parameters of the data processing nodes.

The method can be applied to various types of data processing clusters, and is described by taking a data processing cluster with a Flink framework as an example, because Flink provides rich interfaces, after the Flink is deployed on a machine, the Flink has JobManager and TaskManager, and the two components can monitor the physical machine, such as a CPU (central processing unit) and a memory of a server, and the throughput of operators used by tasks; the JobManager is also called a Master, and is used for coordinating distributed execution, specifically including resource application, task distribution and the like. At least one Master processor exists in the Flink runtime, and a plurality of Master processors exist if a high-availability mode is configured; the task manager is also called Worker and is used for executing data stream tasks, data buffering, data stream exchange and the like, at least one Worker processor exists when the Flink runs, and the Master processor and the Worker processor can be directly started on a physical machine. The Worker processor is connected to the Master processor and informs of the availability of the Worker processor to obtain task allocation.

In the embodiment of the present disclosure, the cluster management device has a JobManager, the data processing device has a TaskManager, and each data processing device may specifically be a physical machine, which includes at least one data processing node, and the at least one data processing node may perform unified management and scheduling through the TaskManager of the physical machine.

After the Flink task is started, the data processing cluster generates a graph structure (JobGraph), nodes in the graph structure are operators of the task, and all the operators can be extracted through the JobGraph provided by the Flink. Specifically, the operators in the current data processing cluster can be stored in the queue, traversed, and the computing power of each operator is calculated, so as to evaluate the pressure condition of the operators.

The operation parameters of the data processing node in the embodiment of the present disclosure may include an input end network cache utilization rate and an output end network cache utilization rate; accordingly, referring to fig. 3, a method for determining a node to be confirmed is shown, which may include:

and S310, determining a data processing cluster corresponding to the current traversal cycle.

S320, determining the node to be confirmed in the current traversal period from the data processing cluster corresponding to the current traversal period; wherein the input end network cache utilization rate of the node to be confirmed is greater than the output end network cache utilization rate.

Because data processing clusters may be deleted or added, the data processing clusters are generally in dynamic change, and the data processing clusters corresponding to different time nodes may be different, so that the data processing cluster corresponding to the current traversal cycle may be the data processing cluster at the starting time of the current traversal cycle; therefore, the corresponding data processing cluster is firstly determined in each traversal period, and then the node to be confirmed is determined from the corresponding data processing cluster, so that the accuracy of determining the node to be confirmed is improved. In different traversal cycles, data processing nodes in corresponding data processing clusters may change, and the nodes to be confirmed are determined to be based on the data processing cluster corresponding to the current traversal cycle in each traversal cycle, so that the data processing cluster corresponding to the current traversal cycle is determined first; and then judging whether the data processing node is a node to be confirmed according to the input end network cache utilization rate and the output end network cache utilization rate of each data processing node in the data processing cluster corresponding to the current traversal cycle. Specifically, when the data processing node a is determined to be a node to be confirmed, the data processing node a may be counted for the node to be confirmed to record the number of times the data processing node a is regarded as the node to be confirmed. The node count to be confirmed is a continuous count, and when an interruption occurs in a traversal period in the counting process, the counting needs to be started from zero. It should be noted that, in each traversal cycle, each data processing node in the data processing cluster corresponding to the traversal cycle is judged once whether the node is a node to be confirmed; when the judgment process is finished, checking the count of the nodes to be confirmed of each data processing node to confirm the number of times that each data processing node is determined as a node to be confirmed.

In a specific embodiment, the traversal cycle 1, the traversal cycle 2, the traversal cycle 3, the traversal cycle 4, and the traversal cycle 5 are consecutive traversal cycles, and the data processing node a is determined as a node to be confirmed in both the traversal cycle 1 and the traversal cycle 2, so that the node to be confirmed of the data processing node a is counted as 2 after the traversal cycle 2, and the data processing node a is not determined as a node to be confirmed in the traversal cycle 3, so that the node to be confirmed of the data processing node a is counted as 0 after the traversal cycle 3; and so on.

And S220, determining an overloaded node from the plurality of nodes to be confirmed.

Determining the data processing nodes which are determined as nodes to be confirmed in a continuous preset number of traversal cycles as the overload nodes; in one embodiment, a data processing node may be determined to be an overloaded node when its node to be confirmed counts to a predetermined number. The timing of determining the overloaded node may be after the node to be confirmed is determined for each data processing node in the data cluster in each traversal period, because the overloaded node may be determined based on the result of determining the node to be confirmed in the current traversal period.

The TaskManager in each data processing device can monitor the state of the data processing nodes in the data processing device, so that the state data of the data processing nodes can be actively reported to the JobManager of the cluster management device, and the JobManager can obtain the state information of each data processing node in the cluster; the format of the reported data can be as follows: the node state information is actively reported by a task manager of the data processing equipment where the data processing node is located, and compared with passive detection of the node state by means of an independent monitoring system, the node state information can avoid time delay and ensure that relevant data can be obtained in time, so that the data processing efficiency is improved.

In addition, because a large data platform often has a condition of resource jitter, occasionally some data processing nodes may be subjected to transient adjustment, and the data traffic can be recovered to be normal, so that the judgment of the data processing nodes cannot be judged to be overloaded nodes only because the running state data of the data processing nodes in one traversal period meets the preset condition, and the state data of the data processing nodes in a plurality of continuous traversal periods needs to be integrated to judge the state of the data processing nodes; for example, the traversal of the data processing node is performed every 5 minutes and the report of the node to be confirmed is performed, and when the node to be confirmed is continuously determined 3 times, the node can be determined as an overloaded node. The node to be confirmed here may be a node in which the operation state data corresponds to the data judged to be the overload state data in one traversal cycle. Therefore, the data processing node which is judged to be in the overload state in a plurality of continuous traversal cycles is determined to be the overload node, the accidental determination of the overload node at a single time can be avoided, and the accuracy of determination of the overload node is improved.

Specifically, the running state data includes an input end network cache usage rate inputbufferusk and an output end network cache usage rate outputbufferusk; so that when the preset condition is satisfied: and when the network cache utilization rate of the input end of the current data processing node is greater than the network cache utilization rate of the output end, determining the current data processing node as the node to be confirmed. Because the input end network cache usage rate inputbuffer usage and the output end network cache usage rate outputbuffer usage can be actively reported through the TaskManager in each data processing device, relevant state data can be easily obtained, and the convenience of state judgment of the data processing nodes is further improved.

S230, determining a processing method for the data processing tasks on the overload nodes according to the ratio of the number of the overload nodes to the number of the data processing nodes in the current data processing cluster; wherein the data processing nodes are created based on preset hardware resources of the current data processing cluster.

The current data processing cluster may refer to a data processing cluster when an overloaded node is determined in each traversal cycle; in the embodiment of the disclosure, according to the difference between the ratio of the number of overloaded nodes to the number of data processing nodes in the current data processing cluster, the processing methods for the data processing tasks correspondingly determined are also different; referring specifically to fig. 4, a method of determining a processing method for a data processing task is shown, which may include:

and S410, when the ratio of the number of the overload nodes to the number of the data processing nodes in the current data processing cluster is less than or equal to a preset value, determining that the data processing task is processed based on the data processing nodes in the current data processing cluster.

S420, when the ratio of the number of the overload nodes to the number of the data processing nodes in the current data processing cluster is larger than a preset value, determining to create a new node for the current data processing cluster based on the preset hardware resource, and processing the data processing task based on the data processing nodes in the data processing cluster after the new node is added.

Since the ratio of the number of overloaded nodes to the number of data processing nodes in the data processing cluster can characterize the status information of the current data processing cluster, the status information of the current data processing cluster here may include an overloaded status and a non-overloaded status. When the ratio of the number of the overloaded nodes to the number of the data processing nodes in the current data processing cluster is less than or equal to a preset value, the current data processing cluster is in an un-overloaded state, the data processing tasks can be processed directly based on the nodes in the current data processing cluster, extra processing on the data processing nodes is reduced, and therefore the data processing tasks can be processed timely and quickly; when the ratio of the number of the overload nodes to the number of the data processing nodes in the current data processing cluster is larger than a preset value, the current data processing cluster is in an overload state, and the data processing tasks at the current overload nodes are processed on the basis of saving hardware resources by creating new nodes based on preset hardware resources of the data processing cluster without adding extra hardware resources. And determining a processing method adaptive to the data processing task of the overload node based on the state of the current data processing cluster, so that the processing efficiency of the data processing task can be improved.

In a specific implementation process, assuming that the number of data processing nodes in a current data processing cluster is N, and the number of overloaded nodes is M, when M/N is greater than a preset ratio, for example, M/N is greater than 80%, it can be considered that all data processing nodes in the current cluster are in an overloaded state, that is, the cluster is in an overloaded state, and it is necessary to improve task parallelism, and to allocate tasks at the overloaded nodes; accordingly, when M/N is less than or equal to 80%, it can be considered that there is no overloaded node in the cluster, i.e. the cluster is in an un-overloaded state.

When a data processing node is determined to be an overloaded node, it means that the task processing or the computational pressure at the data processing node is high, and the data to be processed needs to be shunted, i.e. part of the data is shunted to other data processing nodes in the cluster. Since a data processing task generally comprises a plurality of data entries and corresponding operations, the data processing task can be split into a plurality of sub-tasks.

Referring to FIG. 5, a method of processing a data processing task is shown, which may include:

s510, traversing the overload nodes, and executing the following operations based on each overload node:

s520, task splitting is carried out on the data processing task at the current overload node, and a plurality of subtasks are obtained.

S530, processing the plurality of subtasks based on the current overload node and at least one candidate node; the at least one candidate node is a data processing node in the current data processing cluster except the current overloaded node.

Fig. 5 is a processing method for a data processing task when a current data processing cluster is not overloaded, where the current data processing cluster is in an un-overloaded state, and this illustrates that at this time, there is a data processing node that can share processing pressure at an overloaded node in the data processing cluster, so that the data processing task at the overloaded node can be distributed among the currently existing data processing nodes in the data processing cluster, and the processing pressure at the overloaded node is shared by the un-overloaded data processing node in the current cluster, so as to improve data processing efficiency.

Referring to FIG. 6, another method of processing a data processing task is shown, which may include:

s610, determining idle resources in the preset hardware resources, and creating the new node in the idle resources.

And S620, determining the data processing cluster containing the newly added node as the current data processing cluster.

S630, traversing the overload nodes, and executing the following operations based on each overload node:

and S640, splitting the data processing task at the current overload node to obtain a plurality of subtasks.

S650, processing the plurality of subtasks based on the current overload node and at least one candidate node; the at least one candidate node is a data processing node in the current data processing cluster except the current overloaded node.

When a data processing cluster is created, corresponding preset hardware resources can be configured for the data processing cluster to be created, and when data task processing is specifically performed, the preset hardware resources in the data processing cluster can be kept unchanged, that is, the data processing task is processed based on the existing hardware resources.

When the current data processing cluster is in an overload state, it is indicated that existing data processing nodes in the current data processing cluster are not enough to share the processing pressure at the current overload node, and the node can be created based on preset hardware resources. Specifically, the new node may be created in an idle resource in the preset resource, because the number of data processing nodes created on the preset hardware resource may be small at the initial stage of creating the data processing cluster, the preset resource is not fully utilized; therefore, when more data processing nodes are needed, the creation of the new nodes can be carried out in the idle resources; and adding the newly added node into the data processing cluster, and determining the data processing cluster containing the newly added node as the current data processing cluster, so that the utilization rate of hardware resources is improved, and the subsequent processing process of splitting and allocating the data processing task based on the current data processing cluster is similar to that in fig. 5.

Referring to fig. 7, a subtask processing method is shown, which may include:

s710, distributing the plurality of subtasks to the current overload node and at least one candidate node; wherein the current overloaded node is different from the subtasks allocated to at least one of the candidate nodes.

S720, sending a first processing result obtained by processing the distributed subtasks by the current overload node and a second processing result obtained by processing the distributed subtasks by at least one candidate node to an aggregation node, so that the aggregation node obtains a data processing result corresponding to the data processing task based on the first processing result and the second processing result;

When the current data processing cluster is in an un-overloaded state, it can be considered that a data processing node capable of sharing the computation pressure of the overloaded node exists in the current data processing cluster, and multiple sub-tasks can be allocated to different data processing nodes in the data processing cluster, so that the multiple different data processing nodes cooperatively complete one data processing task, and thus, the data processing pressure of the overloaded node is relieved, and the data processing task is efficiently completed.

When the current data processing cluster is in a non-overload state, the occupation ratio of the overload nodes in the current data processing cluster is not large, and then the problem of data skew caused by aggregation can be considered; the data skew problem is caused by the fact that in the data calculation process, hot spots keys are gathered on one data processing node (operator), so that the calculation amount of the data processing node is increased, and the pressure of other data processing nodes is small because no data is input; when the condition is detected, the keys can be scattered through a built-in hash algorithm or a random algorithm to ensure data hash, and the data hash is input into a related data processing node for calculation; and after the result is preliminarily calculated, calculating the preliminary calculation result again by using a reduction hash algorithm, and finally outputting the result.

If the current data processing cluster is in an overload state, it indicates that the plurality of data processing nodes in the current data processing cluster cannot complete the processing work of the data processing task, and thus new data processing nodes need to be created. In the embodiment of the present disclosure, a node adding instruction may be sent to the task manager of the target data processing device by the JobManager of the cluster management device, so that the task manager allocates a calculation resource to a new data processing node according to an available idle resource of the current data processing device to generate a new data processing node, that is, a new added node, where the new added node may be one or more, and by allocating a plurality of subtasks to the overloaded node and the new added node for processing, the parallelism of task execution may be increased, thereby sharing the calculation pressure of the overloaded node and improving the data processing efficiency; in addition, because the newly added nodes are realized by available resources of the existing data processing equipment in the current data processing cluster, the data processing equipment, namely hardware resources, does not need to be additionally added, thereby reducing the consumption of the hardware resources and improving the utilization rate of the resources.

In a specific embodiment, the determination of the number of the newly added nodes may be determined based on a ratio of the number of the overloaded nodes to the number of the data processing nodes in the current data processing cluster, and if the ratio of the number of the overloaded nodes to the number of the data processing nodes in the current data processing cluster is greater than a preset ratio, a difference between the two ratios is calculated; determining the number of corresponding newly added nodes according to the difference of the ratios, wherein the difference of the ratios is positively correlated with the number of the newly added nodes, namely the larger the difference of the ratios is, the more the number of the newly added nodes is correspondingly determined; the smaller the difference of the ratios is, the smaller the number of the newly added nodes is correspondingly determined; therefore, the number of the newly added nodes is adaptively determined based on the difference of the ratios, the problems that the node resource management pressure is high due to the fact that the number of the newly added nodes is too large, and the data processing pressure of the current overload node is not sufficiently shared due to the fact that the number of the newly added nodes is too small are solved.

Accordingly, referring to FIG. 8, a method of task assignment is shown that may include:

and S810, generating an additional identification corresponding to each subtask.

For each split subtask, an additional identifier may be generated for the split subtask, where the additional identifier may be used to identify the corresponding subtask, and may implement allocation of the subtask based on the additional identifier.

S820, determining a subtask corresponding to the current overload node and a subtask corresponding to at least one candidate node based on the additional identifier corresponding to each subtask.

S830, distributing the subtask corresponding to the current overload node, and distributing the subtask corresponding to at least one candidate node.

For example, the data processing node currently generating data skew is S, the data processing task is to calculate SUM value of data, the data format is < key, value >, and assuming that the data processing node S gathers 4 pieces of data, which are <123,1>, <123,4>, <123,2>, <123,5>, and which may correspond to 4 subtasks, since key is 123 and is allocated to the data processing node S, key may be broken up by a random function through a hash algorithm, such as key- { (int) math.random ()% 2}, so that the 4 pieces of data are <123-0,1>, <123-1,4>, <123-0,2>, <123-1,5>, where "-0" and "-1" may be regarded as additional identifiers corresponding to subtasks, and data corresponding to 123-0key value may be allocated to the data processing node S, 123-1 can be assigned to data processing node S +1, i.e., <123-0,1> and <123-0,2> are assigned to S and <123-1,4> and <123-1,5> are assigned to S + 1.

The key values of the data are scattered through a hash algorithm or a random algorithm, the hashing performance of the key values can be guaranteed, a plurality of subtasks can be distributed to different data processing nodes, the data processing tasks are completed through the cooperation of the data processing nodes, and the data processing efficiency is improved.

It should be noted that, because the key values of the data are scattered through the hash algorithm or the related algorithm to determine other data processing nodes to be allocated except the overloaded node, the states of the other data processing nodes may not be known because the key values are random, that is, there may be a case where the data processing node a to be allocated in the current traversal cycle is also the overloaded node, but is allocated with the related subtasks; at this time, because each data processing node in the current data processing cluster is continuously traversed, even if the data processing node a to be allocated is allocated with a subtask, after the task on the data processing node a to be allocated is split, the subtask can be allocated again, that is, under a general condition, the data processing node which is not overloaded is always allocated to the subtask at the overloaded node, so that the related subtask is processed.

In addition, when the task is started, the data processing nodes in the cluster can be registered at the data cluster management device, so that the data cluster management device can know the identification ID or the key value of each data processing node, when the data processing node to be allocated is determined based on the additional identification of the subtask, the data processing device where the data processing node to be allocated is located can be determined, the related subtask is sent to the corresponding data processing device, and then the task is allocated to the corresponding data processing node by the TaskManager of the data processing device.

The corresponding additional identifier is generated for each subtask, and since the subtasks are obtained by splitting the original task, each subtask can carry the task identifier of the original task in addition to the corresponding additional identifier, that is, the identifier of the subtask includes the task identifier and the additional identifier of the original task. Correspondingly, when the data processing node processes the subtask, the obtained processing result also carries the identifier of the subtask, that is, the processing result for each subtask carries the task identifier of the data processing task, and accordingly, referring to fig. 9, a processing result generating method is shown, where the method may include:

s910, extracting a processing result of the task identifier carried by the current overload node, and determining the processing result of the task identifier carried by the current overload node as the first processing result.

S920, extracting a processing result of at least one candidate node carrying the task identifier, and determining the processing result of the at least one candidate node carrying the task identifier as the second processing result.

Because the original task is split into a plurality of subtasks for respective processing, the processing results of the plurality of subtasks need to be summarized to obtain the processing result of the original task; because there are many data processing nodes in the data processing cluster, and one data processing node may process a plurality of different subtasks, which may correspond to different original tasks, in order to conveniently find the subtask processing result corresponding to each data processing task, the search may be performed according to the task identifier carried by each subtask processing result, thereby avoiding confusion of the calculation results, and improving the efficiency of summarizing the calculation results and the accuracy of the data calculation results.

Also by way of example, as described above, <123-0,1> and <123-0,2> are assigned to S, <123-1,4> and <123-1,5> are assigned to S +1, and then the data processing node S performs a summation calculation on <123-0,1> and <123-0,2> to obtain <123-0,3>, and the data processing node S +1 performs a summation calculation on <123-1,4> and <123-1,5> to obtain <123-1,9>, so that the data processing result extracted from the data processing node S may be <123-0,3>, and the data processing result extracted from the data processing node S +1 may be <123-1,9>, and it can be seen that these two data processing results correspond to the original task whose key is 123.

The candidate nodes and the aggregation node are all data processing nodes in the current data processing cluster except the current overload node, wherein the aggregation node can be used for summarizing the calculation results of each subtask, because in the data stream processing process, data streams are generally unidirectional, when task splitting is performed at the beginning, distributed subtask data flow out from the overload node, and thus when result summarizing is performed, data does not generally flow into the overload node in a reverse direction, and thus the node for summarizing results is not generally an overload node; the aggregation node may be determined by the JobManager of the cluster management device according to the pressure state of the data processing node in the current cluster, and a plurality of sub-task processing results of the same data processing task are allocated to the aggregation node for summary processing, for example, the data processing results may be <123-0,3> and <123-1,9>, and <123,12> is obtained through summary calculation.

When the method is specifically applied to a Flink real-time big data platform, the pressure state of each operator in a cluster can be actively reported, after the pressure operator is found, the calculation resources can be automatically adjusted, the calculation pressure of the operator is averaged, the calculation bottleneck is relieved, the situation can be specifically distinguished to judge whether the overall calculation power of the current cluster is insufficient or the data is inclined, and different task processing response methods are adopted according to different situations, which can be specifically referred to in the above content of this embodiment.

Fig. 10 is a block diagram illustrating a data processing apparatus according to an exemplary embodiment, referring to fig. 10, the apparatus including:

a node to be confirmed determination unit 1010 configured to perform determination of a plurality of nodes to be confirmed based on the operation state parameters of the data processing nodes;

an overloaded node determining unit 1020 configured to perform determining an overloaded node from the plurality of nodes to be confirmed;

a processing method determining unit 1030 configured to perform a processing method for determining a data processing task on the overloaded node based on a ratio of the number of the overloaded nodes to the number of data processing nodes in the current data processing cluster; wherein the data processing nodes are created based on preset hardware resources of the current data processing cluster.

the to-be-confirmed node determination unit 1010 includes:

In an exemplary embodiment, the overloaded node determination unit 1020 includes:

In an exemplary embodiment, the processing method determination unit includes:

and the second processing method determining unit is configured to determine to create a new node for the current data processing cluster based on the preset hardware resource when the ratio of the number of the overloaded nodes to the number of the data processing nodes in the current data processing cluster is greater than a preset value, and process the data processing task based on the data processing nodes in the data processing cluster after the new node.

In an exemplary embodiment, the task allocation unit includes:

the device further comprises:

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

In an exemplary embodiment, there is also provided a computer readable storage medium comprising instructions, which may alternatively be ROM, Random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like; the instructions in the computer readable storage medium, when executed by a processor of a server, enable the server to perform any of the methods described above.

Referring to fig. 11, the apparatus 1100 may have a large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1122 (e.g., one or more processors) and a memory 1132, and one or more storage media 1130 (e.g., one or more mass storage devices) storing an application program 1142 or data 1144. Memory 1132 and storage media 1130 may be, for example, transitory or persistent storage. The program stored on the storage medium 1130 may include one or more modules (not shown), each of which may include a sequence of instructions operating on the device. Further, the central processor 1122 may be provided in communication with the storage medium 1130 to execute a series of instructional operations on the storage medium 1130 on the device 1100. The apparatus 1100 may also include one or more power supplies 1126, one or more wired or wireless network interfaces 1150, one or more input-output interfaces 1158, and/or one or more operating systems 1141, such as a Windows Server^TM，Mac OS X^TM，Unix^TM，Linux^TM，FreeBSD^TMAnd so on. Any of the methods described above in this embodiment can be implemented based on the apparatus shown in fig. 11.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A data processing method, comprising:

determining an overloaded node from the plurality of nodes to be confirmed;

and determining a processing method for the data processing tasks on the overloaded nodes according to the ratio of the number of the overloaded nodes to the number of the data processing nodes in the current data processing cluster.

2. The data processing method of claim 1, wherein the operational state data includes input end network cache usage, and output end network cache usage;

3. The data processing method of claim 2, wherein the determining an overloaded node from the plurality of nodes to be confirmed comprises:

4. The data processing method of claim 1, wherein the determining, according to the ratio of the number of overloaded nodes to the number of data processing nodes in the current data processing cluster, the processing of the data processing task at the overloaded node comprises:

5. The data processing method of claim 1, wherein the determining, according to the ratio of the number of overloaded nodes to the number of data processing nodes in the current data processing cluster, the processing of the data processing task at the overloaded node comprises:

6. The data processing method of claim 4, wherein the processing the data processing task based on the data processing node in the current data processing cluster comprises:

7. The data processing method of claim 5, wherein the creating of the new node for the current data processing cluster based on the preset hardware resource includes, based on a data processing node in the data processing cluster after the new node, processing the data processing task:

8. A data processing apparatus, comprising:

and the processing method determining unit is configured to execute a processing method for determining the data processing task on the overloaded node according to the ratio of the number of the overloaded nodes to the number of the data processing nodes in the current data processing cluster.

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the data processing method of any one of claims 1 to 7.

10. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the data processing method of any of claims 1 to 7.