CN114490027A - Distributed job adjustment method, master node, system, physical machine, and storage medium - Google Patents

Distributed job adjustment method, master node, system, physical machine, and storage medium Download PDF

Info

Publication number
CN114490027A
CN114490027A CN202111583453.8A CN202111583453A CN114490027A CN 114490027 A CN114490027 A CN 114490027A CN 202111583453 A CN202111583453 A CN 202111583453A CN 114490027 A CN114490027 A CN 114490027A
Authority
CN
China
Prior art keywords
execution
data
stage
node
partition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111583453.8A
Other languages
Chinese (zh)
Inventor
韩颖
闵雪宾
张炜
汤志鹏
郑君正
陆一峰
陈颖达
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Cloud Computing Ltd
Original Assignee
Alibaba Cloud Computing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Cloud Computing Ltd filed Critical Alibaba Cloud Computing Ltd
Priority to CN202111583453.8A priority Critical patent/CN114490027A/en
Publication of CN114490027A publication Critical patent/CN114490027A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a distributed job adjustment method, a main node, a system, a physical machine and a storage medium, wherein the method comprises the following steps: acquiring a job submitted by a user; generating an execution plan of the job, wherein a plurality of execution stages of the execution plan comprise an upstream execution stage and a direct downstream execution stage; in the process of executing the operation, acquiring statistical information of output data in an upstream execution stage; configuring the direct downstream execution stage according to the statistical information. The embodiment of the application can dynamically adjust the configuration of the downstream execution stage based on the output result of the upstream execution stage, for example, configure the concurrency of the downstream execution stage, allocate a data partition, select a subsequent execution path, and the like. Furthermore, the embodiment of the application can optimize the description of the execution plan in the deep learning field and reasonably optimize the resource allocation. The method and the device can improve the performance of the distributed system.

Description

Distributed job adjustment method, master node, system, physical machine, and storage medium
The present application is a divisional application with the application number of 202110950182.9, application date of 2021, 8/18/month, entitled "distributed job adjustment method, master node, system, physical machine, and storage medium".
Technical Field
The embodiment of the application relates to the field of distributed technologies, in particular to a distributed job adjustment method, a master node, a system, a physical machine and a storage medium.
Background
The distributed system is formed by interconnecting a plurality of physical machines through communication lines and has the characteristics of distributivity, autonomy, parallelism, globality and the like. The distributed system is used for executing the operation submitted by the user, and the operation execution efficiency can be improved through the distributed computing capacity of the distributed system.
The distributed system mainly comprises a main node and a working node. Jobs submitted by a user may be used to generate an execution plan by the host node. The master node may configure the generated execution plan. Based on the configuration of the execution plan, the master node may schedule the work nodes and resources to implement the specific execution of the job during the execution of the job. Based on the wide range of applications of distributed systems, those skilled in the art are constantly striving to improve the performance of distributed systems.
Disclosure of Invention
In view of this, embodiments of the present application provide a distributed job adjustment method, a master node, a system, a physical machine, and a storage medium, so as to accurately and reasonably dynamically configure an execution plan in a job execution process, and implement dynamic adjustment of job configuration, thereby improving performance of a distributed system.
In order to achieve the above object, the embodiments of the present application provide the following technical solutions.
In a first aspect, an embodiment of the present application provides a distributed job adjustment method, including:
acquiring a job submitted by a user;
generating an execution plan for the job, the execution plan including a plurality of execution phases including an upstream execution phase and an immediately downstream execution phase of the upstream execution phase;
in the execution process of the operation, acquiring statistical information of output data of the upstream execution stage;
and configuring the direct downstream execution stage according to the statistical information, so that the direct downstream execution stage executes the operation based on the configuration result.
In a second aspect, an embodiment of the present application provides a distributed job adjustment method, including:
obtaining deep learning operation;
generating an execution plan for the deep learning job, the execution plan including a plurality of execution phases including: a working machine execution phase and a resource optimization execution phase; the working device execution phase is used for calculating the gradient of the deep learning parameter;
in the execution process of the deep learning operation, scheduling a resource optimization node corresponding to the resource optimization execution stage, and determining the resource information which is used historically and is matched with the current execution state of the deep learning operation through the resource optimization node;
and configuring the resource information for the execution stage of the working machine through a resource optimization node.
In a third aspect, an embodiment of the present application provides a master node configured to execute the distributed job adjustment method according to the first aspect or the second aspect.
In a fourth aspect, an embodiment of the present application provides a distributed system, where the distributed system includes a master node and multiple working nodes, and the master node is the master node according to the third aspect.
In a fifth aspect, embodiments of the present application provide a physical machine comprising at least one memory and at least one processor, the memory storing one or more computer-executable instructions, the processor invoking the one or more computer-executable instructions to perform the distributed job adjustment method according to the first or second aspect.
In a sixth aspect, embodiments of the present application provide a storage medium storing one or more computer-executable instructions that, when executed, implement the distributed job adjustment method according to the first or second aspect.
In the distributed job adjustment method provided by the embodiment of the present application, after acquiring a job submitted by a user, a master node may generate an execution plan of the job, where the execution plan includes multiple execution stages, and the multiple execution stages include an upstream execution stage and a direct downstream execution stage of the upstream execution stage. During the execution of the job, the master node may obtain statistical information of output data of an upstream execution stage, so as to configure a direct downstream execution stage according to the statistical information, so that the direct downstream execution stage executes the job based on a configuration result. According to the embodiment of the application, in the job execution process, the configuration of the downstream execution stage can be dynamically adjusted based on the data output result of the upstream execution stage, so that the configuration of the downstream execution stage can be adapted to the actual execution result of the upstream execution stage, the configuration of the concurrency, resources and the like of the downstream execution stage can be matched with the specific execution condition of the job, and the reasonability and accuracy of the execution plan configuration are improved. Therefore, the distributed job adjustment method provided by the embodiment of the application can dynamically adjust the configuration of the execution plan in the job execution process, realize the dynamic adjustment effect of the job, and enable the configuration of the execution plan to be in accordance with the specific execution condition of the job; and then the specific execution of the operation is realized based on the dynamically configured execution plan, so that the operation execution can be completed more reasonably and efficiently by the distributed system, and the operation execution performance of the distributed system is obviously improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1A is a schematic structural diagram of a distributed system.
FIG. 1B is a schematic diagram of an execution plan.
FIG. 1C is a schematic illustration of a DAG.
FIG. 1D is a mapping diagram of a logic diagram and a physical diagram.
Fig. 2 is a flowchart of a distributed job adjustment method according to an embodiment of the present application.
FIG. 3A is a schematic diagram of the processing data allocation for the immediate downstream stage.
Fig. 3B is another flowchart of a distributed job adjustment method according to an embodiment of the present application.
FIG. 3C is an exemplary diagram of Partition assignment to a directly downstream stage.
Fig. 4A is an exemplary diagram of a data shuffle.
Fig. 4B is a further flowchart of a distributed job adjustment method according to an embodiment of the present application.
Fig. 4C is an exemplary diagram of Partition resolution performed in an embodiment of the present application.
FIG. 5A is an exemplary diagram of the Join process.
FIG. 5B is another exemplary diagram of the Join process.
Fig. 5C is a flowchart of a distributed job adjustment method according to an embodiment of the present application.
FIG. 5D is a diagram illustrating yet another example of the Join process.
FIG. 5E is a diagram showing still another example of the Join process.
FIG. 5F is an exemplary diagram of an union operation further illustrated on the basis of FIG. 5D.
FIG. 5G is an exemplary diagram of an union operation further illustrated on the basis of FIG. 5E.
FIG. 6A is an exemplary graph of Sort Merge Join.
FIG. 6B is an exemplary diagram of Broadcast Join.
Fig. 6C is yet another flowchart of a distributed job adjustment method according to an embodiment of the present application.
FIG. 6D is a diagram of an execution plan with multiple execution paths.
FIG. 7A is a flow diagram of generating an execution plan that carries multiple execution paths.
7B, 7C, 7D, and 7E illustrate exemplary processes for converting a physical plan into an execution plan.
FIG. 7F is an exemplary diagram of a complete execution plan after an execution path is selected.
FIG. 7G is another exemplary diagram of a complete execution plan after selection of an execution path.
FIG. 8A is an exemplary diagram of a PS stage and a Worker stage connected in parallel.
Fig. 8B is a further flowchart of the distributed job adjustment method according to the embodiment of the present application.
FIG. 8C is an exemplary diagram of the Resource Optimization node adjusting the resources of the Worker node.
Fig. 9 is a block diagram of a physical machine.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.
A distributed job may be understood as a job that is submitted to a distributed system for execution. Fig. 1A schematically illustrates a structure of a distributed system. As shown in fig. 1A, a distributed system may include: a master node 110 and a plurality of worker nodes 120. The master node 110 and the worker nodes 120 may be considered to be computing nodes in a distributed system, the computing nodes may be carried by physical machines with data computing capabilities, and one physical machine may carry one or more computing nodes.
In a distributed system, the master node 110 is a computing node for management and control. For example, the master node 110 may manage the worker nodes 120, coordinate the concurrency and resources associated with jobs during various execution phases of the execution plan, and the like. In some aspects, the master node 110 acts as a central governing node in a distributed system, also referred to as an execution engine of the distributed system. Worker node 120 is a computing node in a distributed system that specifically executes jobs, which may be managed and coordinated by master node 110 to execute jobs.
When the distributed system executes a job, the job may be submitted by a user to the cluster resource manager through a terminal, and the cluster resource manager pulls up the master node 110. The master node 110 may then parse the job and generate an execution plan. Execution plans describe the process by which data for a job ultimately produces output after undergoing a series of data flows, executions, and changes from the very beginning source table. FIG. 1B is a schematic diagram illustrating an execution plan. As shown in fig. 1B, the execution plan may include: a plurality of stages (execution stages) having a hierarchical relationship. In some embodiments, there may be a tree-like hierarchy between stages. A stage may include one or more tasks. For each stage, the main node 110 may implement scheduling of multiple worker nodes to execute the task of the stage in parallel by configuring the number of worker nodes (concurrency), used resources, and the like, so as to implement execution of the job in the distributed system.
In some embodiments, jobs are typically submitted to the distributed system by the terminal in a request. In one example, the job submitted by the terminal includes a Query statement, such as an SQL (Structured Query Language) statement, that queries the database.
In further embodiments, the execution plan may be described by a DAG (Directed Acyclic Graph). The DAG includes a plurality of vertices (vertex) and connecting edges (edge) between the vertices. FIG. 1C illustrates a schematic diagram of a DAG. It is noted that the actual number of vertices, levels, connecting edges of the DAG may be more complex than fig. 1C, which is merely a simple DAG example shown for ease of understanding. As shown in fig. 1C, the DAG may include 4 vertices V1-V4, and connecting edges 11, 12, 13, and 14. Wherein, the connecting side 11 connects the vertexes V1 and V2, the connecting side 12 connects the vertexes V1 and V3, the connecting side 13 connects the vertexes V2 and V4, and the connecting side 14 connects the vertexes V3 and V4.
A vertex in the DAG may represent an independent stage in the execution plan. The connecting edges between vertices may be directed connecting edges, representing relationships between vertices. Based on the relationship of the connection edge pointing, the connection edge of the vertex connection may be an input connection edge of the vertex (the input connection edge points to the vertex) or an output connection edge of the vertex (the output connection edge points from the vertex to other vertices). For example, in FIG. 1C, connecting edge 12 points to V3, which is the input connecting edge of V3; the connecting side 14 is output by V3 and is an output connecting side of V3; the connecting side 12 is output by the V1, so the connecting side 12 is also used as the output connecting side of the V1; the connecting side 14 is fed with V4, so that the connecting side 14 again serves as the feeding connecting side for V4.
Of the two vertices connected by the connecting edge, the vertex output by the connecting edge is called the vertex immediately upstream of the other vertex, and the vertex input by the connecting edge is called the vertex immediately downstream of the other vertex. For example, in FIG. 1C, where connecting edge 12 connects V1 and V3, V1 outputs connecting edge 12 and connecting edge 12 inputs V3, then V1 may be referred to as the immediate upstream vertex of V3 and V3 may be referred to as the immediate downstream vertex of V1. A vertex may have one or more directly upstream vertices, one or more directly downstream vertices. It should be noted that, in addition to the direct upstream vertex, a vertex may also have an indirect upstream vertex, and the indirect upstream vertex is not directly connected with the vertex, but is located at the upper layer of the vertex and is connected with the vertex through one or more vertices. For example, in fig. 1C, V1 is at the top of V4, and V1 is connected to V4 via V2 or V3, so V1 can be referred to as the indirect upstream vertex of V4. Obviously, a vertex may have an indirect downstream vertex in addition to the direct downstream vertex, and the indirect downstream vertex is not directly connected to the vertex, but is located below the vertex and is connected to the vertex by one or more vertices. For example, in fig. 1C, V4 is at the lower level of V1 and is connected to V1 via V2 or V3, and thus V4 may be referred to as the indirect downstream vertex of V1. The upstream vertices of the vertices may include direct upstream vertices and indirect downstream vertices, and the downstream vertices of the vertices may include direct downstream vertices and indirect downstream vertices.
The execution of the vertex may depend on the direct upstream vertex, that is, the vertex and the direct upstream vertex have an execution dependency relationship, and the vertex needs to be executed after the execution of the direct upstream vertex; the execution of the vertices may also not depend on the immediately upstream vertex, but may be performed in parallel with the immediately upstream vertex.
In further embodiments, the DAG may have two levels of representation: a logical diagram and a physical diagram. The logic diagram may be considered a natural extension of the execution plan, describing the flow of data execution that a user wants to implement for a job. The physical diagram shows the physical attributes that each stage of the execution plan maps to the distributed system, and describes the physical attributes of each stage of the execution plan, such as concurrency of the execution level, resources used by the work nodes, data transmission modes and the like.
FIG. 1D is a diagram illustrating an exemplary mapping of a logical graph to a physical graph. For ease of illustration, FIG. 1D is only illustrated with an execution plan having 4 stages. As shown in FIG. 1D, the logic diagram depicts the 4 vertices of the execution plan (vertices V0, V1, V2, and V3) and the relationship of the vertices (e.g., vertex V0 points to vertex V2, vertex V1 and vertex V2 points to vertex V3), one vertex corresponding to one stage of the execution plan. The logic diagram may embody the data execution flow of the execution plan. After mapping the logic diagram into the physical diagram, the physical diagram may describe physical attributes such as the concurrency degree that each stage needs to be configured, the resources (e.g., CPU resources, memory resources, etc.) used by the working nodes of each stage, and the data transmission manner. For example, in connection with the example of fig. 1D, the physical diagram illustrates that vertex V0 needs to be configured with 3 working nodes (with a concurrency of 3), and vertices V1, V2, and V3 need to be configured with 2 working nodes (with a concurrency of 2), respectively. That is, the physical graph is capable of expressing the physical properties of vertices and connecting edges in the DAG. Through the physical attributes of the vertex and the connecting edge described by the physical diagram, the main node can schedule the working nodes and resources for each stage, so that tasks in the stages can be executed by a plurality of working nodes in parallel, and the execution of the jobs in the distributed system is realized.
The configuration execution plan referred to in the embodiments of the present application may include logic for configuring the execution plan, and physical attributes for configuring the execution plan. The logic for configuring the execution plan may be considered to configure the execution plan in the logic layer, for example, configure an execution flow of the execution plan. The physical attributes for configuring the execution plan may be considered to be that the execution plan is configured on the physical layer, for example, physical attributes such as concurrency, resources, and data transmission methods of the stages of the execution plan are configured.
If the execution plan is configured before the job is specifically executed and cannot be adjusted during the specific execution of the job after the execution plan is configured, such an execution plan is referred to as a static execution plan. That is, the static execution plan is configured to be completed before the job is executed, and cannot be adjusted during the job-specific execution. However, before the job is executed, the actual resources needed by each stage of the execution plan and the reasonable execution path of the execution plan cannot be accurately estimated, which undoubtedly results in that the configuration of the static execution plan is difficult to be reasonable and accurate, and the performance of the distributed system for executing the job is reduced.
Based on this, the embodiment of the present application provides a novel execution plan configuration scheme, which can dynamically adjust the configuration of a downstream stage based on the data output result of an upstream stage in the job execution process, so that the configuration of the downstream stage can adapt to the actual execution result of the upstream stage, and thus the configuration of the concurrency, resources and the like of the downstream stage can be matched with the specific execution condition of the job, and the reasonability and accuracy of the execution plan configuration are improved.
As an alternative implementation, fig. 2 is a flowchart illustrating a distributed job adjustment method provided by an embodiment of the present application. The method flow may be implemented by the master node, and referring to fig. 2, the method flow may include the following steps.
In step S210, a job submitted by a user is acquired.
In step S211, an execution plan of the job is generated, the execution plan including a plurality of stages including an upstream stage and a stage immediately downstream of the upstream stage.
After the terminal submits the job to the distributed system, the master node in the distributed system may parse the job, generating an execution plan for the job, which may be described by a DAG. The execution plan may include a plurality of stages, which may include an upstream stage and a stage immediately downstream of the upstream stage. In some embodiments, any stage of the plurality of stages having a downstream stage may be considered as the upstream stage, and output data of the upstream stage may be input to the directly downstream stage for processing during specific execution of the job. The output data of an upstream stage can be input into one or more direct downstream stages, and one direct downstream stage can also input the output data of one or more upstream stages; that is, an upstream stage may correspond to one or more directly downstream stages, and a directly downstream stage may also correspond to one or more upstream stages. In some embodiments, an upstream stage may be referred to as a stage immediately preceding a downstream stage, and an immediately downstream stage may be referred to as a stage immediately following the upstream stage.
In one example, and as shown in connection with fig. 1C, V1 is the upstream stage of V2 and V3, and V2 and V3 are the directly downstream stages of V1, during execution of a job, the output data of V1 can be input into V2 and V3 for processing; and V2 and V3 are used as upstream stages of V4, V4 is used as directly downstream stages of V2 and V3, and output data of V2 and V3 can be input into V4 for processing. One aspect of the present disclosure is to process the output data of the upstream stage by how much concurrency, resource, etc. the downstream stage needs to configure; for example, how many degrees of concurrency, resources, etc. V2 and V3 need to be allocated to process the output data of V1, and how many degrees of concurrency, resources, etc. need to be allocated to V4 to process the output data of V2 and V3, respectively.
In step S212, statics (statistical information) of the output data of the upstream stage is acquired during execution of the job.
In the process of executing the job, after any stage finishes executing, the main node can collect Statistics (statistical information) of output data of the working node of the stage. Based on this, after an upstream stage of the execution plan completes its execution, the master node may collect statistics of the output data of the upstream stage. In some embodiments, statics of the output data of an upstream stage may include any of: the data amount of the output data (for example, the data amount of the output data before and after data compression, respectively), the data amount distribution information of the Partition (data Partition) of the output data, the Record (serialized data Record) number of each Partition in the output data, and the like.
In step S213, the direct downstream stage is configured according to the staticism so that the direct downstream stage executes a job based on a configuration result.
Based on the statistical information of the output data of the upstream stage, the main node can configure the direct downstream stage so that the configuration of the direct downstream stage can be adapted to the actual execution result of the upstream stage, and the configuration of the concurrency, resources and the like of the downstream stage can be matched with the specific execution condition of the operation; furthermore, the direct downstream stage can execute the operation based on the configuration result, and the task corresponding to the direct downstream stage can be executed and completed reasonably and efficiently.
In the job execution process, in the manner provided by the embodiment of the present application, the direct downstream stage of the upstream stage is dynamically configured, so that the execution plan can dynamically adjust the configuration in the job execution process, and the configuration of the execution plan can be adapted to the specific execution situation of the job. That is to say, in the process of executing the job, the embodiment of the present application may dynamically adjust the distributed job based on the output data and the statistical information of the upstream stage, so as to achieve the effect of dynamically adjusting the job. And then the specific execution of the operation is realized based on the dynamically configured execution plan, so that the operation execution can be completed more reasonably and efficiently by the distributed system, and the operation execution performance of the distributed system is obviously improved.
In some embodiments, configuring a directly downstream stage based on statistics of the output data of the upstream stage may include any of:
according to the statistical information, configuring concurrency for a direct downstream stage;
according to the statistical information, allocating Partition for the direct downstream stage; for example, the Partition output by the upstream stage is allocated to the work node of the direct downstream stage, or, in a Join (connection) scenario, the Partition output by the upstream stage is split and then allocated to the direct downstream stage, so as to perform Join operation on the work node of the direct downstream stage;
selecting a subsequent executed direct downstream stage according to the statistical information; in the execution plan, a plurality of candidate execution paths can be configured at the downstream of the upstream stage, so that in the specific execution process of the job, an actually executed execution path is selected from the plurality of execution paths based on the execution result of the upstream stage, and the execution logic of the execution plan is more accurate and reasonable; in this case, a stage in one execution path may include at least a stage immediately downstream of an upstream stage; by selecting a plurality of execution paths of the subsequent candidates of the upstream stage, the dynamic adjustment of the execution logic can be realized, and the selection of the directly downstream stage of the subsequent execution can be realized.
The above description of the case where the directly downstream stage is configured according to the statistical information will be separately described, and will not be expanded here.
In the distributed job adjustment method provided by the embodiment of the application, after acquiring a job submitted by a user, a master node can generate an execution plan of the job, wherein the execution plan includes a plurality of stages, and the plurality of stages include an upstream stage and a direct downstream stage of the upstream stage. During the execution of the operation, the main node can obtain the statistical information of the output data of the upstream stage, so as to configure the direct downstream stage according to the statistical information. According to the embodiment of the application, the configuration of the downstream stage can be dynamically adjusted based on the data output result of the upstream stage in the job execution process, so that the configuration of the downstream stage can adapt to the actual execution result of the upstream stage, the configuration of the concurrency, resources and the like of the downstream stage can be matched with the specific execution condition of the job, and the reasonability and accuracy of the execution plan configuration are improved. Therefore, the distributed job adjustment method provided by the embodiment of the application can dynamically adjust the configuration of the execution plan in the job execution process, realize the dynamic adjustment effect of the job configuration, and enable the configuration of the execution plan to be in accordance with the specific execution condition of the job; and then the specific execution of the operation is realized based on the dynamically configured execution plan, so that the operation execution can be completed more reasonably and efficiently by the distributed system, and the operation execution performance of the distributed system can be obviously improved.
The following describes an implementation scheme for configuring the concurrency of the direct downstream stage based on the statistical information of the output data of the upstream stage.
For a static execution plan, the concurrency of each stage of the execution plan may be determined by the master node through pre-estimation rules. For example, after the job is submitted, the master node may configure the concurrency of each stage of the execution plan (which may be regarded as the concurrency of each vertex in the configuration DAG) according to the total amount of source data of the job by using a prediction rule, or configure the concurrency of each stage of the execution plan based on the concurrency of different types of stages specified by the user. However, the source data processed by the distributed jobs is complex and various, and it is often difficult for the master node to rely on the pre-estimation rule to configure the concurrency degree suitable for different jobs, which results in inaccurate concurrency degree configuration of the stages of the execution plan. For example, for a stage with a small processing data amount, if a large concurrency degree is statically configured, the computing resources of the distributed system are wasted; for a stage with a large data processing amount, if a small concurrency degree is statically configured, the execution time of the stage is prolonged, and even various errors such as memory use overrun are caused, so that the operation execution fails. Therefore, in the implementation of the concurrency of the static configuration, in order to avoid the possibility that the concurrency of the stage configuration is low and mass data is difficult to process, the stage often needs to configure a high concurrency, which causes more computing resource waste in the actual execution process of the job. For example, for a Map-Reduce job, even if the upstream Map stage only generates 1KB of output data in the actual running process, the downstream Reduce stage still schedules a higher number of working nodes to process the 1KB of data due to the higher concurrency degree configured statically, which certainly results in unnecessary waste of computing resources.
Based on this, it is especially necessary to configure concurrency for the stage immediately downstream of the upstream stage based on the output data result of the upstream stage during the job execution. In some embodiments, for a stage immediately downstream of an upstream stage, the master node may allocate a Partition (data Partition) in the output data of the upstream stage to a work node of the immediately downstream stage according to a principle that the Partition number is uniform. For example, each worker node of the immediately downstream stage may process a consecutive number of partitions output by the upstream stage, and each worker node processes the same number of partitions. This way can be called as an Even Reduction strategy based on Partition number, which can achieve ideal effect under the condition that the data volume of each Partition is Even; however, in a practical environment, data distribution characteristics are various and often have uneven characteristics, and for uneven data (the data amount of each Partition is not even), the above manner may cause a data tilt problem to occur to a single working node of an immediately downstream stage, that is, the processing data amount of the single working node is much larger than that of other working nodes, thereby further causing a long tail problem of the working node, and causing unnecessary lengthening of job execution time. For example, FIG. 3A illustrates a process data allocation diagram for a directly downstream stage. As shown in fig. 3A, the amount of data of each Partition output by an upstream stage is not uniform, and the value in the Partition output by the upstream stage may represent the amount of data of the corresponding Partition, and if it is only simple to combine a plurality of partitions with a specified number to a single work node of an immediate downstream stage, although the number of partitions processed by each work node is consistent (for example, each work node of the immediate downstream stage in fig. 3A processes 2 partitions), this may cause uneven distribution of the work nodes of the immediate downstream stage in data processing amount: on one hand, data processed by part of the working nodes is unreasonably small (even there may be data that the working nodes do not have to process at all), on the other hand, the data skew problem is further aggravated because part of the working nodes with a large data volume is merged, and the running time of the working nodes is prolonged to form a long tail.
In addition, since the number of partitions may be positively correlated to the time consumption required for actual computation as a whole, but for some data types (e.g., integer type), the data compression rate is large, and there may be a case where the data file is severely compressed after the output data is compressed. In this case, a small number of Partition numbers may correspond to a large number of data records (i.e. a large number of data records are recorded in a small number of Partition numbers), so that the concurrency configuration of the immediate downstream stage is simply performed based on the Partition number of the upstream stage, which may cause a large uncertainty in the actual operation.
Based on the above situation, the embodiment of the present application provides a scheme for performing adaptive dynamic adjustment on the concurrency of the direct downstream stage based on the statistical information of the output data of the upstream stage in the job execution process, and ensures that the data processing amount of each working node in the direct downstream stage approaches to equilibrium. As an alternative implementation, fig. 3B illustrates another flowchart of a distributed job adjustment method provided in an embodiment of the present application. The method flow may be implemented by the master node, and referring to fig. 3B, the method flow may include the following steps.
In step S310, a job submitted by a user is acquired.
In step S311, an execution plan of the job is generated, the execution plan including a plurality of stages including an upstream stage and a stage immediately downstream of the upstream stage.
In step S312, at least the stage initially executed in the plurality of stages is configured with a concurrency degree.
The master node may initially configure the concurrency of the execution plan. In some embodiments, the master node may perform initialization configuration on the concurrency of the execution plan according to the total amount of source data of the job by using an estimation rule; alternatively, the master node may initially configure the concurrency of the execution plan based on the user-specified concurrency of different kinds of stages.
In the embodiment of the application, the main node can configure the concurrency of the direct downstream stage based on the data volume output by the upstream stage; based on this, since the stage executed first in the execution plan has no upstream stage, when the master node performs initial configuration on the concurrency degree of the execution plan, the master node should at least perform initial configuration on the concurrency degree of the first executed stage, so that the first executed stage can execute the job through the configured concurrency degree; on the other hand, for a stage not executed first in the execution plan, since the stage not executed first generally has an upstream stage, the stage not executed first may be a stage directly downstream of one or more upstream stages, and the concurrency degree is configured based on the statistical information of the output data of the upstream stages.
Since the stage that is not executed first in the execution plan can dynamically adjust the concurrency level in the job execution process, when the concurrency level is initialized and configured, the embodiment of the present application does not necessarily need to initialize the concurrency level for the stage that is not executed first. Of course, the embodiment of the present application may also support the concurrency of the stage initialization configuration that is not executed first, and the embodiment of the present application does not set any limit to this. In one example, as shown in fig. 1C, V1 as the stage that is executed first needs the initialization configuration concurrency, and there is an upstream stage for V2, V3, and V4, and the embodiments of the present application do not necessarily limit the initialization configuration concurrency for V2, V3, and V4.
In step S313, during execution of the job, data amount distribution information of output data of an upstream stage is determined, the data amount distribution information including data amounts of a plurality of data partitions corresponding to the output data.
During the process of executing the job, the master node may schedule the working node to process the input data of the upstream stage according to the concurrency of the upstream stage, so that the working node scheduled by the upstream stage may generate the output data of the upstream stage after processing the input data. The output data of an upstream stage may be partitioned into multiple data partitions (partitions). The data amount of the data partition of the output data may be used as data amount distribution information of the output data, that is, the data amount distribution information of the output data may indicate the data amount distributed by each data partition of the output data, and the data amount distribution information may be carried in the statistical information of the output data.
In this embodiment, the data volume distribution information of the output data of the upstream stage may be collected by the master node.
In some embodiments, the upstream stage referred to in step S313 may be a stage executed first, or may be a non-stage executed first. It should be noted that, if the upstream stage is the stage executed first, the upstream stage may schedule the work node to process data based on the concurrency of the initialization configuration; if the upstream stage is a stage that is not executed first, the upstream stage may schedule the working node to process the data after dynamically adjusting the concurrency based on the scheme provided by the embodiment of the present application.
In step S314, the processing Data amount allocated to the work node of the immediate downstream stage is determined based on the Ideal Data amount (Ideal Data Size) corresponding to the work node of the immediate downstream stage and the Data amount distribution information.
After the master node collects data volume distribution information of the output data of the upstream stage, the master node may allocate a processing data volume to a work node of the immediate downstream stage based on the data volume distribution information. According to the embodiment of the application, after the work node of the direct downstream stage distributes the processing data volume, the concurrency of the direct downstream stage can be automatically configured.
In some embodiments, an ideal data amount corresponding to a work node of an immediate downstream stage may be preset, and if the processing data amount of the work node exceeds the ideal data amount, the processing load of the work node will be exceeded, which increases the possibility of execution failure of the work node, so when allocating processing data to the work node of the immediate downstream stage, the data amount of the processing data should be made close to but not exceed the ideal data amount.
Based on the data volume of the data partition of the upstream stage expressed in the data volume distribution information, when the master node allocates the processing data volume to the working node of the direct downstream stage, the master node may determine the processing data volume allocated to each working node of the direct downstream stage based on the ideal data volume and the data volume distribution information, and make the processing data volume allocated to the working node not exceed the ideal data volume.
In some embodiments, the master node may allocate a data partition whose continuous and total amount of data output by an upstream stage does not exceed the ideal amount of data to one worker node of an immediately downstream stage, thereby enabling the worker node of the immediately downstream stage to allocate to the data partition whose continuous and total amount of data does not exceed the ideal amount of data. In some further embodiments, the master node may ensure that the processing data amount allocated by each direct downstream stage is as balanced as possible on the basis of allocating a plurality of data partitions, which are continuous and whose total data amount does not exceed the ideal data amount, to one working node of the direct downstream stage.
In one example, FIG. 3C illustrates an example diagram of data Partition (Partition) allocation to a directly downstream stage. As shown in fig. 3C, the Partition output by the upstream stage has a plurality of blocks, and the data amount of the Partition in the block of the Partition is shown in fig. 3C; it can be seen that when allocating processing data amount to the direct downstream stage, the master node allocates a data partition whose total data amount does not exceed the ideal data amount to one working node of the direct downstream stage, and ensures that the processing data amount of each working node approaches to uniform distribution. For example, the 4 work nodes of the immediately downstream stage are allocated processing data amounts of 15, 16, 19 and 10, respectively. It can be seen that after the processing data amount allocated to the working node of the immediate downstream stage, the master node may complete configuring the concurrency of the immediate downstream stage, for example, after each working node in fig. 3C completes allocating the processing data amount, the master node may determine that 4 working nodes (with a concurrency of 4) are required to process the output data of the upstream stage. The concurrency is configured through the method provided by the embodiment of the application, the situation that the processing data volume of a certain working node of a direct downstream stage is too high or too low compared with other working nodes although the processing data volume of the working node does not exceed the ideal data volume can be avoided, and therefore the processing data volume of each working node can approach to balance and reasonable.
In step S315, the concurrency of the direct downstream stage is configured according to the number of work nodes that allocate the processing data amount in the direct downstream stage.
After determining the processing data amount allocated by the working node of the direct downstream stage, the host node may complete the concurrency configuration of the direct downstream stage based on the number of the working nodes allocated with the processing data amount.
According to the distributed job adjusting method provided by the embodiment of the application, after the execution plan of the job is generated, at least the concurrency degree of the initialization configuration of the stage executed first in the execution plan can be configured, so that the job can be actually executed. During the execution of the job, the main node can determine data quantity distribution information of output data of an upstream stage of an execution plan, wherein the data quantity distribution information can express the data quantity of a plurality of data partitions corresponding to the output data; therefore, the processing data volume distributed for the working node of the direct downstream stage is determined based on the ideal data volume corresponding to the working node of the direct downstream stage and the data volume distribution information; furthermore, the main node can configure the concurrency of the direct downstream stages according to the number of the working nodes which are allocated with the processing data amount in the direct downstream stages, and enable each working node in the direct downstream stages to execute the operation with the data amount which does not exceed the ideal data amount, thereby reducing the probability of execution failure caused by overhigh processing data amount of a single working node in the direct downstream stages. The embodiment of the application can be in the operation execution process, data volume distribution information based on the output data of the upstream stage, and the ideal data volume corresponding to the work node of the direct downstream stage, the data volume is distributed and processed for the work node of the direct downstream stage, and the concurrency of the direct downstream stage is automatically configured, the output data result based on the upstream stage is achieved, the effect of dynamic self-adaptive adjustment is carried out on the concurrency of the direct downstream stage, the accuracy and the reasonableness of the concurrency configuration of the direct downstream stage are improved, the unreasonable concurrency configuration is greatly reduced, the waste of computing resources is avoided, and the performance of a distributed system can be obviously improved.
The scheme provided by the embodiment of the application can be called a Fair-Parallelism (uniform similarity) strategy based on the Partition data volume. As an optional implementation process, taking an example that the upstream stage implements a data shuffle (a process of calculating a Hash value for a data record according to a specific Hash function and sending the Hash value to a corresponding work node in the distributed job execution), the master node may obtain data amount distribution information (data amount of each Partition) of partitions output by the work node of the upstream stage after the data shuffle. The master node may allocate processing data amounts to the worker nodes of the immediate downstream stage in a Partition aggregation, for example, by allocating partitions to one worker node of the immediate downstream stage that are contiguous and whose total data amount does not exceed a desired data amount. In the distribution process, if the total data amount of the multiple partitions distributed to one working node reaches the ideal data amount, the subsequent partitions which are continuous and the total data amount of which does not exceed the ideal data amount are automatically distributed to the next working node, and so on until all the partitions output by the upstream stage are distributed. The actual concurrency of the direct downstream stage can be automatically determined after all Partition of the upstream stage is completely allocated, that is, the concurrency of the direct downstream stage is determined based on the number of the working nodes for allocating processing data volume in the direct downstream stage. The embodiment of the application can distribute the Partition output by the upstream stage to each working node of the direct downstream stage as uniform as possible, realize more accurate and reasonable concurrency configuration for the direct downstream stage, and avoid the long tail problem of a single working node.
It should be noted that, after the step S315 configures the concurrency degree for the direct downstream stage, if the direct downstream stage further inputs other downstream stages through the connection edge (that is, the direct downstream stage also has a downstream stage), the direct downstream stage currently configured with the concurrency degree in the step S315 may become an upstream stage, and based on the scheme provided in the embodiment of the present application, the concurrency degree continues to be adjusted for the subsequent downstream stages until each stage of the execution plan completes the adjustment of the concurrency degree in the job execution process.
In further embodiments, the embodiments of the present application may be further improved through a technical improvement to reduce the occurrence of data skew in a working node of a stage directly downstream, so as to meet various data characteristic requirements of job execution, specifically:
when the data volumes of a plurality of consecutive partitions are all very small, the sum of the data volumes of the plurality of partitions is less than the ideal data volume, in an extreme case, a large number of partitions may be allocated to a single working node for processing, and if the working nodes of the immediate downstream stage sequentially operate the large number of partitions, the read-write performance may be backed. Based on this, the embodiment of the application can set the upper limit of the Partition quantity allowed to be allocated by a single working node, so that the upper limit of the Partition quantity added into the working node is controlled on the premise that the data volume processed by the working node is avoided to be zero.
When the correlation between the operation time of a working node and the size of the processed data is small, and the correlation is related to other characteristics (such as the number of data records) of the data, if the processing data is distributed to the working node according to the size of the data volume of the Partition, the Partition with a small data volume but a large number of data records can be distributed to a single working node, so that the operation time of the working node is prolonged, and the bottleneck of the operation execution is caused; in addition, the complexity of calculation needs to be considered by considering the unit data records involved by the working nodes, which may be related to the number of operators of the working nodes, the characteristics of the operators and the like, and the information also needs to be considered as the calculation concurrency. Based on this, besides using the data size of the Partition as a reference to allocate the processing data amount of the working node, the embodiment of the present application may further combine the Record quantity of the Partition, the operator number of the working node, the operator complexity, and other features, perform secondary adjustment on the processing data amount allocated to the working node, and select the result with a large concurrency obtained in these several dimensions, thereby completing the configuration of the final concurrency of the direct downstream stage. Based on this, in an optional implementation of step S315, in this embodiment of the application, the final concurrency configuration of the direct downstream stage may be further completed according to the number of the work nodes that allocate the processing data volume in the direct downstream stage, the Record number of the Partition allocated to the work nodes, the operator number of the work nodes, and the operator complexity.
According to the method and the device, when the concurrency of operation execution is dynamically adjusted, the execution result can be counted and calculated by combining the data characteristics of multiple dimensions, and more optimization directions are provided for the dynamic execution mode of the distributed system engine; the embodiment of the application is applied to an actual production environment, can reduce the quantity level of the overall execution concurrency of production operation, obviously improves the operation efficiency of the working nodes in a distributed system, and avoids the waste of computing resources to a great extent and the consumption of frequent scheduling and pulling of the working nodes. Compared with a simple and direct Even-Reduction strategy, the Fair-parallelim strategy provided by the embodiment of the application can effectively avoid serious data inclination possibly introduced under the Even-Reduction strategy, so that the data volume processed by all working nodes in distributed operation is distributed uniformly as much as possible, a prominent long tail or short board is avoided, the long tail can be prevented from becoming a bottleneck of running time, and resource waste caused by repeatedly starting the working nodes due to the fact that the working nodes frequently process smaller data can also be prevented. Further, by limiting the upper limit of the Partition number of consecutive merging when merging partitions with a small data amount, excessive merging of partitions is prevented from causing a decrease in the performance of reading data at a downstream stage. Furthermore, according to detailed data information (including data volume and Record quantity) generated by each working node during running of statistical operation, more balanced and perfect concurrency adjustment can be made by combining the Record quantity, the operator number, the operator complexity and other information. Therefore, the performance of the distributed system can be obviously improved.
Based on the statistical information of the output data of the upstream stage, the embodiment of the application provides a scheme for dynamically merging a plurality of partitions. Meanwhile, for the Partition with larger data size, the embodiment of the application also provides a Partition splitting scheme. The specific implementation is described in detail below.
For data layout and shuffle, effective data layout and shuffle are important preconditions for horizontal expansion of a distributed system. For the execution stage of Map-Reduce, the efficient data shuffle is always one of the important performance indexes of the execution performance of distributed operations. However, a fully-connected data shuffle can be executed more efficiently only in an ideal scene where data is distributed uniformly. In actual production operation, the distribution of data is often not uniform, and the tilt characteristic of the data may be further amplified in full-shuffle mode, thereby causing a significant elongation of the running time of individual working nodes and causing a severe long tail.
It should be noted that the concept of shuffle comes from the computation model of Map-Reduce, and although the modern job execution framework has evolved to be described using a more general DAG-based description, the operation mode of Map-Reduce is still an important description of each sub-graph in the DAG. In many more complex DAG topologies, data flow on each connection edge can also be described by various shuffle models; for example, in a distributed execution framework, one important physical property of a connecting edge of a DAG is the transmission of data on the edge. The transmission of the data on the side can not only use full-shuffle, but also allow a more dynamic and intelligent data arrangement mode to be introduced, and therefore, the problems faced by the full-shuffle in many practical scenes can be solved.
For example, with dynamic partition writing, the use of partition tables is extremely widespread for a typical distributed multi-bin system. Writing data into a partition table, wherein two modes of static partition writing and dynamic partition writing are generally available; when the value needing to be written into the partition can be specified in advance, the static write-in of the specified partition is directly used in a simpler mode; the Partition value cannot be judged in advance, and particularly when data generated by one query is distributed in a plurality of partitions, Dynamic Partition (Dynamic Partition) writing is used, namely the value of the data writing Partition is obtained according to specific generated data in the operation process. For example, in the following SQL (Structured Query Language) statement, the column name of the partition designated as the counter is written out, but specific partitions to which the data is written, and the corresponding partition values are obtained in operation:
INSERT OVERWRITE TABLE Partitioned_sales Partition(country);
SELECT country,item,price from sales。
in a distributed system, due to the uncertain number of partitions, the diverse and uneven characteristics of data distribution characteristics, it is always a challenging problem to implement general efficient dynamic partition writing. In this regard, a distributed system needs to avoid serious long tails of jobs, and at the same time, not bring serious burden to a storage system due to the generation of a large amount of small files.
In some embodiments, the master node may implement dynamic partitioning through a single Map stage's execution plan. In this implementation, the execution plan generated by the master node contains only one Map stage that can execute the task through one or more worker nodes. After each worker node of the Map stage reads the data, the file can be written out according to the partition value (e.g., count in the above example). That is, if a working node processes data corresponding to different count values, a file corresponding to a different path is generated. The execution plan of a Map stage is simple and intuitive, but various hidden dangers are brought to the actual large-scale distributed system, wherein the most prominent problem is the fragmentation of small files. In an execution plan of a Map stage, assuming that the concurrency of the Map stage is M, the number of county values in the user data may be N. In the case of random data distribution, since each worker node of Map stage outputs independently, it may eventually cause M files to be written out of each partition. This implementation will eventually result in M x N data files, where there may be a large number of small files. The existence of these large fragmented small files has a great negative impact on a distributed system:
for the storage of a distributed system, the management of a large number of small files needs to consume huge Meta information; in extreme cases, even the main node of the whole distributed system can be exploded, so that the whole distributed system is unavailable; meanwhile, in terms of storage efficiency, small files with fragmentation also bring poorer storage compression ratio and occupy more storage space;
data always needs to be processed after being generated, after an upstream stage outputs a small file, the calculation consumption of a downstream stage is increased, and meanwhile, the actual data reading efficiency is lower;
for Map stage, because M × N small files are written out, M × N write services (writers) are created, each writer keeps a certain Memory Buffer for Encoding, compressing, and the like; if the Buffer is too large, the memory used by Map stage will be too large, and if the Buffer is too small, the Encoding and compression effects will be poor.
In other embodiments, the primary node may implement dynamic partitioning based on a fully connected Reshuffle approach. In order to avoid the above-mentioned small file problem, the master node may first perform a full-connection Reshuffle according to the Partition Key after generating input data in an execution stage, that is, aggregate the data of the same Partition onto a working node and then write out the aggregated data. In this way, it is ensured that only one Partition file is generated for each Partition value of the Partition. However, the mandatory limitation can bring another negative effect while reducing the number of files: the data is skewed.
It should be noted that, for data generated in the execution phase, the distribution of the data is various, and the data distribution of the dynamic partition cannot be acquired before the Query is executed. For example, in the above example of the SQL statement, if the user data of a certain platform is partitioned in different countries, the data in the partition of the country is undoubtedly much larger than the data in other partitions, which results in a serious tail of the work node corresponding to the data in the partition of the country, and the running time of the whole job is greatly prolonged. For heavily skewed data, such a long tail may introduce hundreds or even thousands of job delays, which also has a very bad impact on the resource utilization of the entire distributed system.
To alleviate this problem, Reshuffle implementation can actually be optimized by additionally introducing a Random shuffle Key. For example, by taking a Reshuffle key of [0,9], the data is scattered randomly over 10 partitions to reduce the skewness of the data. However, such a solution still presents certain problems: under the condition of serious data inclination, even if the data is divided into 10 parts, the divided data still has the inclination problem and can be changed into 10 slightly light data long tail conditions from one data long tail condition; for data without skew or for partitions with a small amount of data itself, if the data is also cut into 10 parts, this will result in an increase in the number of final files (10 × N); the distributed system forcibly adds the shuffle mode of the Random Key, which may destroy the idempotency of the data, and in the distributed system, if the work node runs again, there is a risk of generating the correctness of the data.
Based on the above situation, the embodiment of the present application performs intelligent dynamic data arrangement according to the data distribution situation generated by an upstream stage (e.g., a Map stage) in real time, and solves the problem of small files while ensuring the written data balance, thereby overcoming the disadvantages of the two schemes.
First, fig. 4A is used to briefly describe the shuffle manner (for the sake of understanding, details of shuffle according to Reduce concurrency modulus are omitted here), and fig. 4A shows an exemplary diagram of the data shuffle. As shown in fig. 4A, the boxes with the same linear represent the data of the same Partition Key in the shuffle data generated in the Map stage; for example, a box of a thin dotted line, a box of a thin solid line, a box of a thick dotted line, and a box of a thick solid line represent data of the same Partition Key, respectively. And the values in the boxes represent the corresponding data amounts. After a shuffle, the data of the same Partition Key is handed to the same working node in the Reduce stage for processing. As can be seen from fig. 4A, the amount of data processed by the worker node R #0 in the Reduce phase is 8, the amount of data processed by the worker node R #1 is 2, the amount of data processed by the worker node R #2 is 43, and the amount of data processed by the worker node R #3 is 16; there is a severe data skew of the data processed by the worker node R #2, which results in a very severe long tail of the worker node R #2 when performing the task of the Reduce phase.
For the scenes of dynamic partition insertion and the like, the data of the same partition only needs to be written into the same directory, and the data of the same data file does not need to be written into the same directory. Therefore, in the shuffle process, it is not necessary to ensure that all data of the same Partition Key are handed to the same reduce work node for processing. Based on the characteristic, the Partition with large data volume generated in the Map stage can be automatically split, and then the Partition is handed to a plurality of working nodes in the Reduce stage for processing.
As an alternative implementation, fig. 4B illustrates a further flowchart of the distributed job adjustment method provided in this embodiment of the present application. The method flow may be implemented by the master node, and referring to fig. 4B, the method flow may include the following steps.
In step S410, a job submitted by the user is acquired.
In step S411, an execution plan of the job is generated, the execution plan including a plurality of stages including an upstream stage and a stage immediately downstream of the upstream stage.
In some embodiments, the upstream stage is, for example, a Map stage. The immediate downstream stage is for example a Reduce stage.
In step S412, during execution of the job, Statistics of output data of an upstream stage is acquired, the Statistics including: and outputting data quantity distributed information of the Partition of the data.
In some embodiments, the worker node(s) of an upstream stage may report statics of the output data to the master node during and after the end of execution of the task. The statics may include: data amount distribution information of the partitions of the output data may indicate a data amount of each of the partitions of the output data. In some further embodiments, the statics may further include: the amount of output data (e.g., the amount of output data before and after data compression, respectively), the Record number per Partition in the output data, etc.
In step S413, according to the ideal data amount, Partition whose data amount is larger than the ideal data amount in the output data of the upstream stage is split.
According to the embodiment of the application, the ideal data volume of the work node of the direct downstream stage for Partition can be preset. Note that, when a different stage of the execution stage is used as the immediately downstream stage, the set ideal data amount may be different. The master node, after acquiring statics of the output data of the upstream stage, may determine the amount of data for each Partition in the output data. Therefore, whether the Partition with the data volume larger than the ideal data volume exists in the output data is judged based on the ideal data volume of the Partition of the working node of the direct downstream stage; for Partition whose data volume is greater than the ideal data volume in the output data, the Partition may be split according to the embodiment of the present application, and the data volume of the split Partition is not greater than the ideal data volume.
In some embodiments, according to the ideal data amount, the Partition whose data amount is greater than the ideal data amount in the output data may be split, so that the split Partition has a data amount that is not greater than the ideal data amount and is approximately uniformly distributed.
In step S414, the split Partition is assigned to the immediate downstream stage, and one split Partition is configured to be executed by one worker node of the immediate downstream stage.
After the Partition of the output data is split, the split Partition can be allocated to a direct downstream stage, and one split Partition is configured to be executed by one working node of the direct downstream stage, so that the data to be processed is configured for the direct downstream stage.
In some further embodiments, if there are partitions in the output data whose data amount is smaller than the ideal data amount, the present application embodiment may merge at least two partitions in the output data whose total data amount is not greater than the ideal data amount, where the data amount of each Partition in the at least two partitions is smaller than the ideal data amount, so as to allocate the merged Partition to the immediate downstream stage, and one merged Partition is configured to be executed by one work node of the immediate downstream stage, so as to achieve efficient utilization of the computation resource of the immediate downstream stage.
According to the method and the device, Statistics of output data of an upstream stage is acquired in the operation execution process, so that the Partition of which the data volume is larger than an ideal data volume is split based on the data volume of each Partition of the output data indicated in the Statistics, and the data volume of the split Partition is not larger than the ideal data volume. And then the split Partition is allocated to the direct downstream stage, and one split Partition is configured to be executed by one working node of the direct downstream stage, so that the data needing to be processed is configured for the working node of the direct downstream stage. The embodiment of the application can carry out dynamic adjustment to the Partition configured for the direct downstream stage according to Statistics of the output data of the upstream stage, and the data volume of the Partition processed by each working node in the direct downstream stage can not exceed the ideal data volume, so that the data processed by the working nodes of the direct downstream stage can approach to uniform distribution, the data inclination condition of the direct downstream stage is reduced, and the problem that the individual working nodes of the direct downstream stage can greatly prolong the running time due to the Partition needing to process larger data volume is avoided. Therefore, the data processed by the working nodes of the direct downstream stage can be enabled to approach to uniform distribution, the data inclination condition of the direct downstream stage and the long tail problem of the working nodes are reduced, and the performance of a distributed system is obviously improved.
In some embodiments, for the Map-Reduce phase, Map may be considered an upstream stage and Reduce may be considered an immediate downstream stage of Map. Each working node in the Map stage can report statics of output shuffle data to an execution engine (the execution engine can be a main node) in the running process and after the running is finished; for example, one or more working nodes may exist in the Map phase to execute the task, and the staticisms of output shuffle data can be reported to the execution engine during and after the task is executed by each working node. In some embodiments, statics of shuffle data is, for example: the data amount of the shuffle data before and after compression, the data amount of each Partition of the shuffle data, the Record number contained in the Partition, and the like.
In some embodiments, the embodiment of the present application may define an ideal amount of data corresponding to one Partition of the Reduce phase. For any Partition output in the Map stage, if the data volume of the Partition is larger than the ideal data volume, the Partition can be split according to the ideal data volume, and the data volume processed by each working node in the Reduce stage is ensured to be uniform as much as possible; if the data volume of the Partition is smaller than the ideal data volume, the Partition can be merged with the data volume smaller than the ideal data volume, and the merged data volume of the Partition is not larger than the ideal data volume.
In one implementation example, fig. 4C shows an exemplary diagram of a Partition splitting performed by an embodiment of the present application. As shown in fig. 4C, the 4 shuffle data output from the Map stage may be divided into 4 partitions (P #0, P #1, P #2, and P #3), for example, data of the same linear expression in the 4 shuffle data are aggregated into one Partition, where data corresponding to the thin dashed line in the shuffle data are aggregated into P #0, data corresponding to the thin solid line are aggregated into P #1, data corresponding to the thick dashed line are aggregated into P #2, and data corresponding to the thick solid line are aggregated into P # 3. As shown in fig. 4C, the data size of the 4 partitions is 8, 2, 43, and 16 in order. Assuming that the ideal data volume corresponding to one Partition in the Reduce phase is defined as 10, the data volumes of P #0 and P #1 are both less than 10, and the data volume after the P #0 and P #1 are combined does not exceed the ideal data volume 10, so that P #0 and P #1 can be combined into one Partition and distributed to one working node R #0 in the Reduce phase for processing. Since the data volume 43 of P #2 is greater than the ideal data volume 10, P #2 may be split into multiple partitions as uniformly as possible according to the ideal data volume, and the data volume of each split Partition is not greater than the ideal data volume, as shown in fig. 4C, P #2 may be split into 5 partitions as far as possible according to the ideal data volume, and the partitions are allocated to 5 working nodes R #1 to R #5 in the Reduce phase for processing, where R #1 to R #4 each allocate a Partition whose data volume is 9, and R #5 allocates a Partition whose data volume is 7. Similarly, the data amount of P #3 is greater than the ideal data amount, and therefore, the data amount is divided into 2 partitions equally, and the partitions are distributed to 2 working nodes R #6 and R #7 in the Reduce stage for processing.
The mechanism for splitting the Partition of which the Map stage is larger than the ideal data volume and combining the Partition of which the Map stage is smaller than the ideal data volume can be called an Adaptive shuffle mechanism. In some further embodiments, based on the Adaptive shuffle mechanism, the number of files finally written out in the Reduce phase will mainly depend on the size of the input Partition data volume and the size configuration of the ideal data volume. Assume that the Ideal Partition number (Ideal parallelisms) for the input Reduce phase is defined as: dividing the total amount of shuffle data by the ideal amount of data; after the Adaptive shuffle mechanism in the embodiment of the application, the maximum value of the number of files finally generated in the Reduce stage is Ideal parallelisms plus N. Therefore, if at least one data file is ensured to exist on each Partition with output data, the number of files is increased after only the Partition with the data quantity larger than the ideal data quantity is split. But the size of the file generated by the splitting is in the order of Partition, not the fragmented small file. Based on this, under the condition that the size of the ideal data volume is reasonably configured, no matter how many files are generated, the method is reasonable for a distributed system; that is, as long as the file size is suitable, a large amount of data can be stored in a file with a plurality of points.
That is, based on the Adaptive shuffle mechanism, the data amount processed by each working node in the Reduce phase does not exceed a given ideal data amount, and each Partition smaller than the ideal data amount can output a file from one working node in the Reduce phase. Therefore, the problems of generation of a large number of small files and possible data inclination are solved, and the dilemma faced by a dynamic partition scene can be solved well. It should be noted that, by using the Adaptive shuffle mechanism, the situation that an additional Random shuffle Key needs to be added to reduce the data skew can be avoided. Because the Adaptive shuffle mechanism provided by the embodiment of the application is deterministic and re-enterable in the whole process, the correctness of output data can be fundamentally ensured when various retries occur in an unstable distributed system environment.
According to the embodiment of the application, the self-adaptive data arrangement is realized by combining the statistical information collection capability provided by the execution engine (main node) during the operation period and the dynamic adjustment capability of the multi-operation execution graph and the shuffle mode; the method can intelligently distribute and arrange the output data based on the data characteristics of the output data of the upstream stage, including automatic splitting of the inclined data partition and combination of a plurality of small partitions with large data volume, fundamentally solves the problems of data inclination and long tail of a working node possibly brought in the process of data shuffle, avoids the problem of data fragmentation under other shuffle-free schemes, and can remarkably improve the performance of a distributed system.
In addition to general optimization for Partition merging and splitting, the embodiments of the present application are specifically optimized for a specific scenario. Taking a Join scene as an example, an implementation scheme is described below in which, based on statistical information of output data of an upstream stage, Partition output by the upstream stage is split and then allocated to a direct downstream stage to perform Join operation.
During the execution of the job in the distributed system, Join operation is one of the most common and more complex operations. Besides the challenges that need to be solved in a distributed system, the interaction of different paths of data on the Join operator also derives more data processing scenarios. However, the uneven data distribution and the like cause the problems of data skew and long tail in Join operation, which are relatively common problems in distributed systems and have not been solved systematically all the time.
FIG. 5A illustrates an example diagram of the Join process for distributed SQL. As shown in FIG. 5A, two input data (M1 and M2 shown in FIG. 5A) provided by an upstream stage to a downstream stage perform Join operation at the downstream stage, wherein the two input data of M1 and M2 can be written according to Partition and are divided into different work nodes from shuffle to J3 according to Partition numbers to realize data Join. Wherein the intermediate data is stored on the physical medium according to the partition arrangement.
Fig. 5A shows a case where the Partition data distribution of the upstream stage is relatively uniform, but in actual query and data processing, the Partition data distribution of the upstream stage is likely to have a tilt. FIG. 5B shows another exemplary diagram of the Join process for distributed SQL. As shown in fig. 5B, there is a severe data skew in Partition1 (Partition with Partition number 1) in the input data provided by M1 to J3, and there is a slight data skew in Partition1 in the input data provided by M2 to J3, in this case, when J3 performs Join operation on Partition1, there is a severe long tail in the working node of J3, even the operation fails due to memory overrun.
To solve the above problem, fig. 5C shows another flowchart of the distributed job adjustment method provided in the embodiment of the present application. The method flow may be implemented by the master node, and referring to fig. 5C, the method flow may include the following steps.
In step S510, a job submitted by a user is acquired.
In step S511, an execution plan of the job is generated, the execution plan including a plurality of stages including a Join stage and a direct upstream stage of the Join stage, the direct upstream stage providing multiple paths of input data to the Join stage; wherein, one path of input data comprises a plurality of partitions.
In the case of Join operation, a Join stage, as a stage immediately downstream of a stage, may input multiple input data provided by the stage, where a single input data may include multiple partitions.
In step S512, in the process of executing the job, if there is a target Partition with data skew in any input data, the target Partition is divided into a plurality of partitions, and the plurality of sub partitions are allocated to a plurality of work nodes of the Join stage.
In the process of executing the job, the master node may obtain staticisms of output data of any stage, and based on the staticisms, if a Partition of data skew exists in any path of input data provided by a directly upstream stage to the Join stage (for convenience of explanation, the Partition of data skew exists may be referred to as a target Partition), the embodiment of the present application may split the target Partition into multiple partitions, and allocate the multiple partitions to multiple work nodes of the Join stage.
In some embodiments, the method and the device for determining the Partition data size can determine whether the data size of the Partition is larger than a first data size threshold or not for the Partition in any path of input data, and determine that the Partition is a target Partition with data skew if the data size of the Partition is larger than the first data size threshold.
In some embodiments, when the target Partition is split, the embodiments of the present application may split the target Partition based on the second data amount threshold, so that the data amount of the plurality of sub-partitions after the target Partition is split is uniformly distributed and the data amount of each sub-Partition is not greater than the second data amount threshold. In some further embodiments, the second data amount threshold is smaller than the first data amount threshold, and specific values of the first data amount threshold and the second data amount threshold may be defined according to practical situations, which is not limited in this application embodiment. In one example of an implementation, the second data amount threshold may be, for example, the ideal data amount described previously.
In some embodiments, when multiple sub-partitions of a target Partition split are assigned to multiple worker nodes of a Join stage, one Partition may be assigned to one worker node of the Join stage.
In step S513, the Partition belonging to the same Partition number as the sub-Partition in the other routes of input data is broadcasted to the working node assigned by the sub-Partition.
In order to realize correct Join operation at a Join stage, in the embodiment of the application, after splitting the target Partition of a certain path of input data into a plurality of sub-partitions and distributing the sub-partitions to the working nodes of the Join stage, aiming at each sub-Partition after splitting the target Partition, the embodiment of the application can determine the Partition which belongs to the same Partition number with the sub-partitions in other paths of input data (the input data which is different from the input data of the certain path of input data in multiple paths of input data), so that the determined Partition is broadcasted to the working nodes distributed by the sub-partitions, and correct Join can be carried out on the sub-partitions and the Partition which belongs to the same Partition number in other paths of input data at the Join stage.
In some embodiments, the multiple input data provided by the immediately upstream stage may include at least a first input data and a second input data, where the first input data and the second input data may be any two input data of the multiple input data, and the number of the multiple input data may be greater than two, but is not limited to only two input data. In this case, the presence of the target Partition in the multiplexed input data may be as follows.
In the first case, one path of input data in the multi-path input data has a target Partition; under the first condition, taking the existence of one target Partition in the first path of input data in the multi-path input data as an example, the master node can split the target Partition in the first path of input data into a plurality of sub partitions, and each sub Partition is allocated to a work node of a Join stage; and the Partition which belongs to the same Partition number as the target Partition in other paths of input data (the input data except the first path of input data in the multi-path input data) is broadcasted to the working node distributed by each sub-Partition.
As an implementation example, fig. 5D shows a diagram of still another example of the Join process for the first case described above. As shown in FIG. 5D, FIG. 5D shows the inputs of two Join paths (M1 and M2), and where there is a data skewed scene in one path. Assuming that operators with characteristics such as Partition, Sort and the like do not need to be maintained additionally on the working node executing the Join, after the data volume of each Partition output by M1 and M2 is counted by the master node, the Partition1 output by M1 can be judged to be a target Partition with data inclination, so that the Partition1 output by M1 is split into a plurality of sub-partitions. In fig. 5D, a small box in the box where the Partition1 of M1 is located may represent a sub-Partition of the split. The Partition1 of M1 splits into a plurality of sub-partitions, which can be respectively assigned to a plurality of work nodes in the Join phase to perform Join processing. Meanwhile, in order to perform correct Join operation on the split sub-partitions in the Join stage, the master node needs to broadcast the partitions with the same Partition number as that of the target Partition in the M2 to the working nodes distributed by the sub-partitions respectively; for example, the master node may broadcast Partition1 generated by M2 to each child Partition assigned work node of M1. By broadcasting the Partition with the same Partition number of M1 and M2 to a plurality of working nodes in the Join stage, the correct Join of data with the same Partition number can be guaranteed and the result can be produced under the condition that the target Partition is divided into a plurality of sub partitions.
In the second case, one path of input data in the multi-path input data has a plurality of target partitions; in the second case, taking the example that the first path of input data in the multiple paths of input data has multiple target partitions, the processing for each target Partition in the first path of input data is the same as that in the first case. In some embodiments, for each target Partition in the first path of input data, the master node may split the target Partition into a plurality of sub-partitions, and allocate the sub-partitions of the target Partition to the work nodes of the Join stage, and meanwhile, partitions belonging to the same Partition number as the target Partition in other paths of input data will be broadcast to the work nodes allocated by the sub-partitions. However, when there are multiple target partitions in the first path of input data, each target Partition needs to be processed in the above manner.
That is to say, if one target Partition or multiple target partitions exist in the first path of input data, in terms of any target Partition, the embodiment of the present application may split the target Partition into multiple sub-partitions, allocate each sub-Partition of the target Partition to a work node of a Join stage, and broadcast the partitions belonging to the same Partition number as the target Partition in other paths of input data to the work node allocated to each sub-Partition of the target Partition.
Further referring to fig. 5D, if multiple parities in the single-path input of the Join have data skew, each of the parities having data skew may be split, and another path of parities with the same partition number is broadcast to the working node in the Join stage, so that correct Join is performed on the split sub-parities and another path of data with the same partition number in the Join stage. For example, assuming that there is data skew also in the Partition3 output by M1 in fig. 5D, the Partition3 output by M1 may also be split into multiple sub-partitions and respectively allocated to multiple working nodes in the Join phase, and meanwhile, the Partition3 in M2 may be broadcast to the working nodes allocated by the sub-partitions corresponding to the Partition3 in M1.
In the third situation, at least two paths of input data in the multi-path input data have a target Partition; under the third condition, taking the case that the first path of input data and the second path of input data both have one target Partition as an example, the master node can split one target Partition in the first path of input data into a plurality of sub partitions, and distribute each split sub Partition to a work node of a Join stage; meanwhile, the main node can divide a target Partition in the second path of input data into a plurality of sub partitions, and distribute the divided sub partitions to the working nodes of the Join stage. For the sub-partitions with the same Partition number in the first path of input data and the second path of output, the sub-partitions need to be broadcasted to the working nodes distributed by the sub-partitions of other paths with the same Partition number. For example, the sub-Partition of the first path of input data is broadcasted to the working nodes distributed by the sub-partitions of the second path of input data with the same Partition number, and the sub-Partition of the second path of input data is broadcasted to the working nodes distributed by the sub-partitions of the first path of input data with the same Partition number.
FIG. 5E is a diagram showing still another example of the Join process. As shown in fig. 5E, there is data skew between the Partition1 of M1 and M2, the Partition1 output by M1 on the left side needs to be split into multiple sub-partitions, and the Partition1 output by M2 on the right side needs to be split into multiple sub-partitions, so that the multiple sub-partitions split by Partition1 of M1 will be allocated to multiple working nodes of the Join phase, the multiple sub-partitions split by Partition1 of M2 will be allocated to multiple working nodes of the Join phase, and one sub-Partition is allocated to one working node. In fig. 5E, a small box in the boxes where the Partition1 of M1 and M2 is located may represent a sub-Partition of the split. In order to ensure that correct Join can be performed on the sub-partitions with the same Partition numbers of M1 and M2, the master node can broadcast the plurality of sub-partitions split by the Partition1 of M1 to the working nodes distributed by the sub-partitions with the same Partition numbers of M2, and similarly, the master node can broadcast the plurality of sub-partitions split by the Partition1 of M2 to the working nodes distributed by the sub-partitions with the same Partition numbers of M1.
In the fourth situation, at least two paths of input data in the multi-path input data have a plurality of target partitions; in a fourth case, taking the case that the first path of input data and the second path of input data both have multiple target partitions, the processing procedure for one same target Partition in the first path of input data and the second path of input data is the same as that in the third case. However, under the condition that a plurality of target partitions exist in both the first path of input data and the second path of input data, each identical target Partition of the first path of input data and the second path of input data needs to be processed according to the above mode.
The above example is an operator that mainly implements pure Join connection with the working nodes of the Join phase, and does not contain data sorted/Partitioned attributes. In actual operation, as the most complex and flexible Join operation in the SQL syntax, the work node may contain various operators. In this case, when data tilt occurs to the Partition of the input Join, in addition to splitting the Partition in the above-described exemplary manner, an union (union) operation needs to be added after the data Join is completed, so that the split sub-Partition is folded again after the Join. When a special attribute of the data needs to be preserved (for example, the data is not directly dropped, but has a downstream operation/stage), the correctness of subsequent execution needs to be guaranteed through an union operation.
FIG. 5F is an exemplary diagram of an union operation further illustrated on the basis of FIG. 5D. As shown in fig. 5D and 5F, when data tilt occurs in the part 1 of the single input M1 of the joint, in addition to using the process shown in fig. 5D to implement joint operation on the sub-part corresponding to the part 1 of M1 and the part 1 of M2, joint operation results of the joints of the sub-parts are also needed to generate a new part 1.
FIG. 5G is an exemplary diagram of an union operation further illustrated on the basis of FIG. 5E. When there is data skew in both the multiple inputs M1 of the joint and the Partition1 of M2, in addition to using the process shown in fig. 5E to implement the joint operation between the sub-Partition of the Partition1 of M1 and the sub-Partition of the Partition1 of M2 with the same Partition number, the joint operation results of the sub-partitions need to be combined to generate a new Partition 1.
That is, the master node needs to configure and Join the Join operation results belonging to the same data partition, where the Join operation results of the same data partition include: the Join operation result of the data partition with the same partition number in the input data of one path and the input data of other paths, or the Join operation result of the data partition with the same partition number in the input data of one path and the input data of other paths.
According to the embodiment of the application, aiming at the inclined scene of the input data of the Join operation, when inclined parts exist in multiple paths or single path of the Join input, the self-adaptive splitting and distribution can be carried out, the uniform distribution of the input data of the Join operation is realized, and the correctness of the Join operation in a new shuffle mode is guaranteed. Compared with the mode of manually adjusting the processing logic of the operation data, the method and the device for processing the data and the like which require a large amount of intervention of the end user, the method and the device for processing the inclination of the input data of the Join operation have adaptivity and universality. Especially, under the condition that the main node collects the statistical information of the output data of each stage, the embodiment of the application enables the adjustment of the data inclination to be automatically executed without manual adjustment of a user. The self-adaptability of the scheme provided by the embodiment of the application enables the dynamic decision to be made according to the data characteristics input by actual Join during the operation period of the operation without the perception and participation of an end user, and enables the data to be uniformly distributed among a plurality of distributed working nodes, so that the operation with data inclination can be accelerated remarkably, and the performance of a distributed system can be improved remarkably.
In addition to the dynamic adjustment for Partition, according to the statistical information of the upstream stage, the embodiment of the present application may adaptively select the execution path of the subsequent execution. Specific implementations are described below.
The embodiment of the application can dynamically adjust the logic diagram of the execution plan in the job execution process (namely, dynamically adjust the logic of the execution plan in the job execution process). It should be noted that the selection of the logic diagram may be related to data distribution and characteristics that can only be accurately obtained during the execution of the job. Different data characteristics may require different logic execution plans to be configured for efficient and accurate implementation. Thus, for static execution plans, if the execution plan is determined once and cannot be dynamically adjusted during execution of a job, it is certainly impossible to achieve a reasonable and accurate configuration of the logic of the execution plan.
Taking Join operation in distributed SQL as an example, there are many kinds of Join algorithms that are logically completely equivalent to different joins, such as Sort Merge Join and Broadcast Join. Taking the Join for implementing two source tables of Table1 and Table2 as an example, the implementation procedures of Sort Merge Join and Broadcast Join will be described below.
FIG. 6A shows an exemplary graph of Sort Merge Join. As shown in fig. 6A, two input source tables Table1 and Table2 may be read by M1 and M2 and subjected to preliminary processing, for example, after M1 performs data filter (filtering) on Table1, output data is partitioned according to shuffle/Join key; the output data of M1 and M2 can be used for merge Join operation at the downstream work node according to the same key. In the above process, implementing the Merge Join requires performing a full shuffle and sort operation on the output data of M1 and M2, and by ensuring that the data of the same partition can be distributed to the same downstream working node. However, in a distributed system, the concrete implementation that the Sort Merge Join depends on the external discharge and the like can process any data volume, but a large amount of shuffle and Sort operations involved in the process need to consume more calculation and network resources; and when the data distribution is not uniform, the data after the shuffle may cause a serious long tail, which affects the execution efficiency.
FIG. 6B shows an exemplary diagram of Broadcast Join. While distributed systems need to have Join capabilities to handle large tables (e.g., fact tables) and large tables, for Join operations of large tables and small tables (e.g., dimension tables), the computation/network consumption of distributed shreds and sort can be reduced if the size of the data source of the small tables can be loaded into the memory of a single working node. As shown in fig. 6B, Table1 in fig. 6B is a large Table, and Table2 is a small Table, so that data of the small Table can be Broadcast to all working nodes of the large Table, and a full hash Table (hash Table) is established according to the data of the small Table, so that Join can be performed through hash Table lookup after the data of the large Table is read. The data of such a large Table (Table1) only needs to be read once, and there is no need to perform any shuffle and sort of data. Meanwhile, in addition to avoiding the computation/network consumption of shuffle and sort, the possible skew and long tail of shuffle can be avoided because in the Join pattern of brocast Join, the logic of Join actually occurs in the read logic of a large table (e.g., Map stage). Based on this, the Broadcast joint is also called Broadcast Map joint. Of course, the Broadcast Join has a specific application range in terms of the resource/performance benefits: the small table data used for building the hash table must be loaded by a single working node in full, and if the optimizer selects Broadcast Join to execute the job, but the small table data amount exceeds the memory limit in the execution process, the whole job fails.
Based on the foregoing description, Broadcast Join has significant advantages in performance, and Sort target Join itself has better versatility, so that the optimizer of the distributed system needs to judge more accurately and reasonably if a suitable Join algorithm needs to be selected between the two: on the premise of ensuring that the operation can be successfully completed, efficient execution plan logic is configured as much as possible. However, in an actual online scenario, it is very difficult for the optimizer to make this determination before the job is executed, mainly for the following reasons:
the lack of accurate statistical information, the variety of sources of data stored in the distributed system, and the various reasons such as the source table data import channel and the import time may cause the lack or inaccuracy of the statistical information of the source table data (for example, a table has not been generated yet just by importing the statistical information, or the update of the table content just touches the threshold of the discarded statistical data). In short, the lack and inaccuracy of the statistical information makes the optimizer unable to accurately predict the magnitude of the upstream input of Join, which may be the source table or the output of the source table after certain logic conversion.
The complexity of data processing logic and data characteristics is varied, even if complete and accurate statistical information can be obtained from the data of the source table, the operation of the Join may objectively exist in any working node in a data processing flow (inside DAG), and various user-defined code logics (UDFs) may be interspersed in the data of the source table along with the complex conversion of selection/filter/aggregation and the like in the upstream, which causes difficulty for the optimizer to estimate the input data volume of the Join in advance.
Because of these limitations, the optimizer needs to face the dilemma if it selects the Join algorithm statically before the job executes: on the one hand, in the case of inaccurate estimation of the size of the input data, the selection of the Map Join execution plan is only conservative as much as possible, for example, the threshold of the small table is set as low as possible, so that a great deal of optimization opportunities are lost. Even if the small table threshold is configured to be very low, due to reasons such as data estimation errors and data expansion, the situation that the execution of the job fails due to misjudgment of the Broadcast Join still exists. Since these extremes may result in a failure to execute the job, feedback to the optimizer's strategy may further select a more conservative strategy, resulting in a negative loop. In addition, the triggering of the Broadcast Join is largely based on the Map Join hit (Map Join prompt) manually added by a human, i.e., the generation of the Broadcast Join plan is decided by the user; the function of the optimizer is added, so that additional difficulty is brought to logic maintenance of a user, and actually, the user can only sense the amount of source table data accurately, but cannot know the output size of non-source table data after data change accurately, so that the user can also not avoid job execution failure caused by data and upstream processing logic change when the user specifies Map Join hit.
Based on the foregoing, the optimizer of the master node is required to make an accurate selection of the Join algorithm before the job is executed, and the accuracy is affected by various objective reasons. Generally, the characteristics of data (including data size and distribution) need to be obtained only after the upstream work node finishes the job execution process, so that the accurate judgment of the Join algorithm needs to be made during the execution process of the distributed job, rather than before the job is executed. However, making the Join algorithm decisions and selections during job execution presents challenges to the DAG dynamic capabilities of the execution engine: when different Join algorithms such as Sort target Join and Broadcast Join are selected, the generated execution plans not only have differences in physical properties (concurrency, shuffle mode, etc.), but also have large changes in the topological logic structure of the DAG, so if dynamic adjustment is to be performed during job execution, the capability of providing a dynamic logic diagram is required (that is, the logic of the execution plans can be dynamically adjusted). It should be noted that the adjustment of the logic diagram is often accompanied by the adjustment of the physical property of the execution plan, so that the requirements on both the DAG dynamic logic diagram and the dynamic physical diagram capacity are actually put forward.
Based on the above description, fig. 6C shows yet another flowchart of the distributed job adjustment method provided in the embodiment of the present application. The method flow can be implemented by the main node. Referring to fig. 6C, the method flow may include the following steps.
In step S610, a job submitted by a user is acquired.
In step S611, generating an execution plan of the job, where the execution plan includes a plurality of stages including an upstream stage, and a control node, where the upstream stage has a plurality of candidate execution paths downstream, and one execution path includes one or more downstream stages of the upstream stage; the control node is used for selecting a target execution path actually executed by the upstream stage at the downstream from the plurality of execution paths.
In the embodiment of the present application, when generating the execution plan of the job, multiple candidate execution paths may be carried downstream of an upstream stage, and one execution path may include one or more downstream of the upstream stage. For example, an execution path may include a stage immediately downstream of the upstream stage, or a stage immediately downstream and a stage indirectly downstream. In some embodiments, one or more upstream stages in an execution plan may carry multiple execution paths of candidates downstream thereof.
In the embodiment of the present application, in addition to a plurality of stages, the execution plan may further include a control node disposed downstream of the upstream stage, where the control node may be in communication with a plurality of execution paths of the upstream stage, and may select one execution path from the plurality of execution paths as a target execution path for actual downstream execution of the upstream stage. It should be noted that the control node does not belong to the execution stage in the execution plan, does not actually schedule the computing resources such as the work node, and identifies that the execution plan needs to add control logic downstream of the upstream stage, so as to select the target execution path for actual execution from the multiple execution paths.
In some embodiments, the execution plan indicated by step S611 may be described by a DAG, e.g., one or more upstream stages of the DAG may have multiple execution paths candidate downstream. As an example, FIG. 6D illustrates a diagram of an execution plan carrying multiple execution paths. As shown in fig. 6D, there are two execution paths of candidates downstream of M1: path0 and Path 1. And a control node C8_1 is arranged at the downstream of M1, and the control node C8_1 is communicated with the Path0 and the Path1, and can be used for selecting a target execution Path of the downstream actual execution of M1 from the Path0 and the Path 1.
In step S612, during the execution of the job, statistical information of the output data of the upstream stage is acquired.
In step S613, a target execution path is selected from the plurality of execution paths based on the statistical information by the control node.
In the process of executing the job, the master node may collect statistical information of output data of each stage of the execution plan, where the statistical information includes statistical information of output data of upstream stages carrying multiple execution paths downstream. For example, as shown in connection with fig. 6D, after M1 completes its execution during job execution, statistics of the output data of M1 may be collected by the master node. Based on the statistical information of the output data of the upstream stage collected by the main node, the main node can select a target execution path to be actually executed from a plurality of execution paths downstream of the upstream stage through the control node based on the statistical information. For example, as shown in connection with fig. 6D, control node C8_1 may select a target execution Path for actual execution downstream of M1 from among Path0 and Path1 based on statistical information of the output data of M1.
The embodiment of the application can enable an upstream stage of an execution plan to have a plurality of candidate execution paths at the downstream when the execution plan is generated, and select a target execution path for final execution from the plurality of candidate execution paths based on the statistical information of the output data of the upstream stage in the process of executing the job. By the method, the final execution path can be selected according to the actual situation of the output data of the upstream stage in the execution process of the job, so that the selection of the execution path is more reasonable and accurate. The operation is executed based on a reasonable and accurate execution path, and the performance of the distributed system can be obviously improved.
As an alternative implementation, the candidate execution paths carried downstream by the upstream stage of the execution plan may include two execution paths, which are the first execution path and the second execution path respectively. FIG. 7A illustrates a flow diagram for generating an execution plan that carries multiple execution paths. The method flow can be implemented by the main node. Referring to fig. 7A, the method flow may include the following steps.
In step S710, a physical plan having a plurality of candidate information is generated according to the predicted data size of the source data of the job, wherein one candidate information represents a task that may be used by the source data during the execution of the job, and the plurality of candidate information includes a first candidate information and a second candidate information.
At the time of job submission, the optimizer may estimate the data size of the source data (e.g., source table) of the job, if the estimated data size of the source data is smaller than a preset first threshold, the optimizer may trigger a conditional execution plan, and based on the tasks that the source data may use during job execution, the optimizer may generate a plurality of candidates of the source data and a physical plan having the plurality of candidates. In some embodiments, a task may include one or more operators (operators), for example, by the processing of the operators to enable the execution of a particular task.
In one example, assuming that the source data is a source table, the source table may use Broadcast Join and Sort target Join when performing Join operations during job execution, so the optimizer may generate two candidates for the source table based on the Broadcast Join and Sort target Join that the source table may use when performing Join operations, where one candidate represents Broadcast Join and the other candidate represents Sort target Join. In some embodiments, the physical plan may be considered a data structure of a node tree.
In step S711, the physical plan is converted into an operator tree, the operator tree including one or more original tasks of the physical plan, one original task using one or more operators.
In step S712, when the preset operator is traversed for the first time, according to the first candidate information corresponding to the first execution path, a task of the first execution path is newly added to the operator tree, and data pipe connection is performed between the operator in the newly added task and the related operator in the original task, so as to transform the first execution path in the operator tree.
To generate an execution plan carrying multiple execution paths, the optimizer may transform the physical plan into an operator tree and then transform the multiple execution paths in the operator tree according to multiple candidate information. In some embodiments, the optimizer may perform multiple traversals of the operator tree, preset an operator for performing path conversion, and when the optimizer traverses the preset operator for the first time, the optimizer may convert a first execution path corresponding to a first candidate information in the plurality of candidate information. In some embodiments, the plurality of candidate information includes at least a first candidate information and a second candidate information, wherein the first candidate information corresponds to the first execution path and can record the task corresponding to the first execution path, and the second candidate information corresponds to the second execution path and can record the task corresponding to the second execution path.
In some embodiments, when the optimizer traverses the preset operators for the first time, the task of the first execution path may be newly added in the operator tree according to the first candidate information, where the newly added task may include one or more operators, and in order to enable the task of the first execution path to communicate with the original task of the computation tree, the optimizer may pipeline the operator of the newly added task with an associated operator in the original task, and the associated operator in the original task may be considered as an operator in the original task that is associated with the operator of the newly added task (for example, an operator in the original task that is located upstream and downstream of the operator of the newly added task). Through the above process, the optimizer may translate out the first execution path in the algorithm tree.
In step S713, when traversing to the preset operator for the second time, adding a task of the second execution path to the operator tree according to the second candidate information corresponding to the second execution path, and performing data pipe connection between an operator in the added task and a related operator in the existing task to transform the second execution path in the operator tree.
In some embodiments, when the optimizer traverses to the preset operator for the second time, the second execution path may be transformed in the operator tree according to the second candidate information. The optimizer may add a task of the second execution path to the operator tree according to the second candidate information, where the added task may include one or more operators, and in order to enable the task of the second execution path to communicate with an existing task in the operator tree, the optimizer may perform data pipe connection between the operator of the added task and a related operator in the existing task, and the related operator in the existing task may be considered as an operator related to the operator of the added task in the existing task (for example, an operator located upstream and downstream of the operator of the added task in the original task). Through the above process, the optimizer may translate out the second execution path in the algorithm tree.
As an alternative implementation, step S712 may be regarded as an alternative implementation for translating the first execution path based on the first candidate information in the algorithm tree, and step S713 may be regarded as an alternative implementation for translating the second execution path based on the second candidate information in the algorithm tree.
In step S714, control nodes are disposed upstream of the first execution path and the second execution path of the operator tree to obtain an execution plan.
After the conversion of the first execution path and the second execution path is completed in the operator tree, in order to enable the selection of the target selection path from the first execution path and the second execution path after the execution of the upstream stage of the first execution path and the second execution path is completed in the job execution process, the optimizer may further set a control node upstream of the first execution path and the second execution path of the operator tree to enable the dynamic selection of the first execution path and the second execution path in the job execution process through the control node. Through the above process, the execution plan carrying multiple execution paths can be generated in the embodiment of the application.
For convenience of describing the above process of generating an execution plan carrying multiple execution paths, the embodiment of the present application introduces a concept of conditional execution plan: the optimizer is allowed to generate an execution plan having a plurality of execution paths based on a plurality of candidates at the time of job submission. Finally, which candidate execution path is selected for use is dynamically selected by the master node during the execution of the job based on the statistical information of the output data generated by the upstream stage.
As an optional implementation, the optimizer may generate an Execution plan carrying multiple Execution paths through two links, namely, Cost-based Optimization and Execution plan generation.
In the Cost-based Optimization link, taking a Join scene as an example, the optimizer may estimate the data size of the small table of the Join in the memory through information such as source table statistics in the build rule of the Join, for example, the data size of the small table in the memory is estimated based on the RowCount (row number) of the small table AverageRowSize. Furthermore, the optimizer may determine whether the size of the data in the small table in the memory is smaller than a preset first threshold (the first threshold may be preset and defined as threshold 1). If the data size of the small table in the memory is smaller than the preset first threshold, the optimizer triggers a conditional Map Join (conditional Map Join) of the database, for example, generating small table data to perform multiple candodes of the Map Join, where a candode may represent a Join operator (e.g., Broadcast Join or Sort Merge Join) used by the small table. Based on a plurality of candidates provided by the conditional Map joint to implement subsequent dynamic decision, the optimizer may configure a first threshold value threshold1 of the Cost-based Optimization link to be larger (default value is 512M), that is, as long as it is determined that a job may utilize Broadcast joint according to a relatively loose probability, a plan of the conditional Map joint is generated, and finally, the execution path is selected, which is handed to the execution engine to perform dynamic selection in the job execution process.
The plan of the foregoing conditional Map Join may be regarded as a physical plan, not as a final execution plan, and the physical plan may be a structure of a RelNode tree (control node tree) which represents a relational expression. In the Cost-based Optimization link, a physical plan includes a control node such as a conditional Map node, and the control node may include multiple paths (execution paths) based on multiple candidates to express paths that may be selected by a Join algorithm, such as the execution paths corresponding to Broadcast joint and Sort Merge joint. Regardless of which Path is finally selected, the computation and data of the small table will be shared among the multiple paths. conditional Map Join the cost defined in the optimizer's cost model may be between the costs of both Broadcast Join and Sort Merge Join, and the ratio of the two costs is determined according to the probability of the final selected Path.
In the Execution plan generation link, the optimizer converts the physical plan generated in the last step into a final Execution plan (Execution plan), for example, converts the physical plan generated by the optimizer into a physical operator tree (physical operator tree) that can be understood by runtime (runtime component) and constructs a DAG topology. The final execution plan thus includes a DAG consisting of tasks and edges, and an operator tree (i.e., a physical operator tree) inside each worker node. In general, the above process is a physical plan subsequent traversal process, and cuts task and adds Edge dynamically during the traversal process. Different from the common query, the conditional Map Join needs to implement dynamic decision and Path selection during running, so that a control task and a dependent edge (dependent connecting edge) need to be added in the execution plan generating process, and thus, the scheduling dependency relationship (no data flow) between an upstream working node and a downstream working node and a simple control node are accurately described through the dependent edge. For example, a common working node including a normal execution operator is connected to the control node through a dependent edge. Through the setting, the control node can select the subsequent execution path by judging the size of the small table after actual operation in the DAG operation process.
Fig. 7B, 7C, 7D, and 7E exemplarily show a process of converting a physical plan into an execution plan. Wherein the master node may execute fig. 7C after completing fig. 7B, may execute fig. 7D after completing fig. 7C, and may execute fig. 7E after completing fig. 7D. In the figure, thick arrows indicate data pipeline, thin dashed arrows indicate control pipeline, and thin solid arrows indicate runtime operator data flow. In one example, if the master node traverses to the conditional Map journal and prepares to convert the multiple paths contained in the conditional Map journal during the physical play, the master node may convert the input relNode tree of the conditional Map journal into an operator tree, and the operators in the operator tree may form two tasks, M1 and M2, shown in fig. 7B, for example, M1 uses two operators, namely TableScan1 (table scan 1) and Filter1 (Filter 1), and M2 uses three operators, namely TableScan2 (table scan 2), Filter2 (Filter 2), and streamwrite 1 (streamwrite 1).
Assuming that the conditional Map joint includes 2 candidates, one of the candidates corresponds to Path0, and the other corresponds to Path1, Path0 may correspond to the joint implementation of Broadcast Hash joint, and Path1 may correspond to the joint implementation of Sort Merge joint. In the process of traversing the operator tree, as shown in fig. 7C, if the master node traverses the filter1 of M1 for the first time, the Path0 may be converted. Specifically, based on the task recorded by the candidate of Path0, the master node may add a new task of Path0 to the operator tree, and perform data pipe connection between the added task and M1 and M2. As shown in fig. 7C, R3 is a new task and has two new operators StreamlineRead1 (stream line read 1), StreamlineWrite2 (stream line write 2), and R1 has two new operators StreamlineRead2 (stream line read 2) and conditional map join 1. The newly added conditional mapjoin1 in M1 is operator id (operator identifier) of Broadcast Hash Join, and the operator type is HashJoin.
The process is executed when the Path traverses to the filter1 in M1 for the first time, so as to translate the Path0 on the basis of the operator tree, and thus express the execution Path of the Path0 in the execution plan; in this process, the master node may record the operator that is originally M1 in the operator tree for subsequent copy. It can be inferred from the four operators newly added at present that R3 and M1 are currently being processed, and at this time, they are recorded as a candidate Path 0.
Then the transformation to Path1 is started, Path1 and Path0 share streamlineWrite and Filter, e.g. streamlineWrite1 and Filter 1. Based on the operator Filter1 shared by Path1 and Path0, Path1 can be translated when the master node traverses Filter 1a second time. Specifically, based on the task recorded by the candidate of Path1, the master node may add a new task of Path1 to the operator tree of translated Path0, and perform data pipe connection between the operator in the added task and the related operator in the current task. As shown in fig. 7D, M5 and J6 are newly added tasks, and when a task of M5 is newly added, operators (such as TableScan1 and Filter1 in fig. 7D) of M1 recorded before can be copied to M5, and at the same time, a StreamlineWrite3 (streamwrite 3) operator is newly added to M5. When a task of J6 is newly added, three operators, StreamlineRead3 (streamline read 3), StreamlineRead4 (streamline read 4) and ConditionalMapJoin1, can be newly added in J6, and data pipe at the operator level between J6 and M2 and M5 is realized.
As shown in fig. 7E, after the translation of Path0 and Path1 is completed, a control node C8 needs to be created in the operator tree, and only one operator, namely, a Conditional operator, in C8 is used to select an execution Path for Join from Path0 and Path1 based on the size in the table execution process during the actual operation of the table.
In some further embodiments, after generating the execution plan with multiple execution paths, the master node may obtain a data amount of output data of an upstream stage during job execution, determine whether the data amount is smaller than a preset second threshold, select the first execution path as the target execution path if the data amount is smaller than the preset second threshold, and select the second execution path as the target execution path if the data amount is smaller than the preset second threshold. If the data processed by the upstream stage is a small table, during the job execution process, if the task corresponding to the small table is executed completely, the control node may determine whether the data output result of the small table is smaller than a preset second threshold (defined as threshold2) according to the actual execution size of the small table, and if the data output result of the small table is smaller than the second threshold, select an execution path of the Broadcast Join (that is, the first execution path is an execution path for executing the Broadcast Join); if the execution output result of the small table is larger than the second threshold value, the execution path of the Sort Merge Join is selected (namely, the second execution path is the execution path for executing the Sort Merge Join). It should be noted that the threshold2 considers the size of the memory occupied by the small table data loaded into the hash table in the actual operation, and the default value of the size may be the same as that of the threshold1 (for example, 512M), although the value of the threshold2 may also be set to be different from that of the threshold1 in the embodiment of the present application. It should be noted that the conditional task plan can be triggered only when the estimated size of the small table satisfies threshold1, and when the main node collects the real size of the small table during the operation of the job, the main node decides the small table to perform the final execution path of the job through threshold 2; therefore, the threshold1 is used for decision making in the optimizer stage, the threshold2 is used for decision making in the DAG operation stage, and the generation of the whole conditional and the selection of the final execution plan are completed by adjusting the two values of the threshold1 and the threshold2 and the cooperation of the optimizer and the DAG.
In terms of dynamic selection of execution paths, as shown in conjunction with FIG. 6D, when a job initially commits, two possible execution paths (e.g., the two execution paths Path0 and Path1 shown by the dashed lines in FIG. 6D) are included in the committed DAG because the final execution Path has not yet been validated. Meanwhile, a control node (C8_1) is newly added in the DAG, the control node can not pull up any working node only in the logical control sense, and the control node selects to execute Path0 or Path1 according to the output of M1. For example, during the job execution process, after M1 reads and processes the candidate of the small table output data, the data volume actually output by it is collected by the master node, and the decision selection of the execution path is performed at the control node based on the data volume actually output by the small table. If the execution Path0 is selected, the complete execution plan may be, for example, as shown in FIG. 7F, and if the execution Path1 is selected, the complete execution plan may be, for example, as shown in FIG. 7G.
The execution plan can be adjusted before the job is executed, so that a plurality of candidate execution paths can exist at the downstream of an upstream stage of the execution plan, and in the job execution process, a target execution path for final execution is selected from the plurality of candidate execution paths based on the statistical information of the output data of the upstream stage. By the method, the final execution path can be selected according to the actual situation of the output data of the upstream stage in the execution process of the operation, so that the selection of the execution path is more reasonable and accurate, the operation is executed based on the reasonable and accurate execution path, and the performance of the distributed system can be obviously improved.
With the widespread use of Deep-Learning (Deep-Learning), the processing requirements of a distributed system to be met by the Deep-Learning job are increasing, and various execution engines suitable for the Deep-Learning job are appeared. However, various drawbacks in scheduling and execution of distributed systems still exist for deep learning jobs. For example, for native logic of a deep learning system (such as Tensorflow), the scheduling and execution of the native logic is completely dependent on an external system and is not configured within the execution engines of the distributed system. Taking a Parameter Server (PS) architecture as an example, the working nodes of the distributed system can be divided into two types: the method comprises the steps of a PS node and a Worker node, wherein the PS node stores deep learning parameters (such as parameters of a deep learning model), and the Worker node is used for calculating the gradient of the deep learning parameters. In each iteration process of deep learning, the Worker node obtains deep learning parameters from the PS node, then the calculated gradient is returned to the PS node, the PS node aggregates the gradient returned from the Worker node and updates the deep learning parameters, and the PS node broadcasts the updated deep learning parameters to the Worker node, so that the deep learning parameters are continuously adjusted. In the PS architecture of the distributed system, the PS node and the Worker node have the following characteristics in the operation process:
the PS node performs a role that is significantly different from that of the Worker node and corresponds to a different stage of the execution plan, among others. A stage corresponding to the PS node may be referred to as a PS stage (parameter server execution stage), and a stage corresponding to the Worker node may be referred to as a Worker stage;
the PS node is used as a serving entity of the Parameter and can independently operate;
when the Worker node uses and updates the Parameter, the PS node is required to be operated effectively after being operated, and data interaction with the PS node is required to be carried out continuously in the operation process.
The above features are difficult to describe in many distributed execution frameworks: although there is a pre-and-post dependency relationship between the PS node and the Worker node in scheduling, because the PS node and the Worker node can operate simultaneously, the dependency relationship cannot be mapped to the logic of the downstream node after the upstream node finishes operating and then schedules. Based on this, in many external systems, the PS node and the Worker node can only be separately scheduled and run corresponding to two isolated and unrelated stages in the execution plan, and this may completely cause the Worker node to idle before the PS node is scheduled, thereby causing resource waste of the Worker node. In addition, because the relation description of the PS node and the Worker node among different stages is lost, many basic functions and dynamic functions cannot be realized.
In the deep learning field, the resources (particularly GPU resources) used by deep learning jobs are typically directly open to end users for designation. For example, the end user specifies the work node concurrency of the deep learning job, the resource size used by each work node (e.g., the number of GPUs used by one work node, etc.). However, it is difficult for the user to select proper resources, and in order to ensure that the deep learning operation can have enough resources, the user often applies for the resources in an excessive amount, which results in a waste of resources. For example, in order to ensure that the GPU usage of the deep learning job is guaranteed, a user may apply for multiple GPU cards for each work node under the condition that the GPU resources actually used cannot be accurately predicted, but only 25% of the GPU cards may be utilized in the actual execution process of the deep learning job, which causes the remaining GPU resources to be idle and causes waste. One cumulative effect caused by this situation is that a large number of GPU resources on the distributed system are reserved due to the application of the user, even the GPU resources applied by the user exceed the total amount of the GPU resources of the distributed system, and the phenomenon that the actual GPU resource utilization rate of the distributed system is low, and other jobs need to queue up the GPU resources for use occurs. On the other hand, many deep learning jobs are particularly sensitive to the use of resources (particularly GPU resources), which may result in a reduction in the performance of the deep learning job if the upper limit of resources allowed for the user is blindly lowered.
Based on the above situation, how to guarantee the resource application accuracy of the deep learning operation and improve the resource utilization rate of the distributed system, so as to save the computing resources of the distributed system while ensuring that the execution performance of the deep learning operation is not affected, which becomes a problem to be solved urgently.
In order to solve the above problem, based on that the vertex and the connection edge in the DAG may both correspond to different logic and physical attributes, in the embodiment of the present application, the physical attributes of a sequential edge and a concurrent edge are introduced to the physical attributes of the connection edge, and the sequential edge and the concurrent edge are decoupled from data transmission. The parallel edge describes that the working nodes of the upstream and downstream stages connected by the parallel edge can be in a running state at the same time, but the scheduling still has precedence, and the scheduling time can be self-defined. For example, the work nodes of the upstream and downstream stages connected in parallel may be scheduled synchronously, or the work nodes of the downstream stage may be scheduled after the work nodes of the upstream stage run to a certain stage by event triggering. The sequential edge describes the working nodes of the downstream stage connected by the sequential edge, and the scheduling operation can be performed only after the execution of all or part of the working nodes of the upstream stage is finished.
Based on parallel edges, the relation between the PS node and the Worker node of the deep learning operation can be more accurately described. FIG. 8A illustrates an example diagram of a PS stage (parameter Server execution phase) and a Worker stage (Worker execution node) for parallel edge join. As shown in fig. 8A, a PS stage is a stage corresponding to the PS node in the execution plan, a Worker stage is a stage corresponding to the Worker node in the execution plan, the PS stage is connected to the Worker stage through a parallel edge in the execution plan, and the parallel edge is input to the Worker stage by the PS stage. Therefore, the Worker node and the Worker node can be in the running state at the same time, and the scheduling opportunity can be customized. In some embodiments, table1 below exemplarily shows a type of scheduling opportunity (referred to as a scheduling type) of a downstream node of a parallel edge connection, which may be referred to, where the upstream node may be considered as a working node corresponding to an upstream stage of the parallel edge connection (e.g., a PS node corresponding to a PS stage shown in fig. 8A), and the downstream node may be considered as a working node corresponding to a downstream stage of the parallel edge connection (e.g., a Worker node corresponding to a Worker stage shown in fig. 8A).
Figure BDA0003427035030000311
TABLE1
Based on the contents of the sequence edges of the connection edges, the physical properties of the parallel edges, the scheduling type of the downstream node and the like, the embodiments of the present application can completely describe various complex DAG logics, thereby supporting various loads, for example: batch jobs, streaming jobs, near real-time/near real-time jobs, deep learning jobs, and the like.
Aiming at deep learning operation, taking PS operation as an example, because a Worker node can use and update a Parameter and can only run after the PS node runs, in an execution plan of the deep learning operation, a connecting edge between a PS stage and the Worker stage is a parallel edge, and the parallel edge is input into the Worker stage by the PS stage; that is, in the execution plan, the Worker stage acts as a stage immediately downstream of the PS stage and is connected by a parallel edge. Since the Worker can only make sense when the PS starts running, the scheduling type of the downstream Worker node is SOURCE _ TASK _ STARTED, that is, the upstream PS node starts to schedule the downstream Worker node after its instance starts. In terms of data generation, since data on a PS node is only valid when it is in a running state, and the upstream data source does not perceive downstream states, the data source type is a type of EPHEMERAL _ STATELESS (transient stateless); and in terms of data transmission type, since data transmission of the PS node and the Worker node may not be perceived by the execution framework, the data transmission type is NONE (null). Therefore, the connection edge connecting PS stage and Worker stage can be described as { CONCURRENT, EPHEMERAL _ STATELESS, NONE, SOURCE _ TASK _ PROGRESS }.
Based on the above description, in the execution process of the deep learning job, the PS node and the Worker node may run simultaneously, and the Worker node must run the PS node and then schedule the resource for data processing. This description of the execution plan for the deep learning job makes it possible to dynamically adjust the job during running. In particular, a deep learning execution engine requires an end user to provide a large number of configuration parameters for an execution plan, such as stage concurrency, required resource size and type, distribution strategy, etc., which are difficult to provide by the end user. Based on this, in the embodiment of the present application, a Resource Optimization node is introduced into an execution engine of a distributed system, and the Resource Optimization node serves as a control node for job execution and can coordinate and dynamically adjust Resource-related requests.
In an example, taking PS jobs as an example, in the embodiment of the present application, a new resource optimization node is added outside a PS node and a Worker node. The resource optimization node is responsible for determining how to dynamically adjust the resources of the Worker node according to a certain rule. On this basis, besides setting the parallel edge input into the Worker stage by the PS stage in the execution plan, it is also necessary to additionally set a Resource Optimization stage corresponding to the Resource Optimization node in the execution plan, and input the parallel edge of the Worker stage by the Resource Optimization stage. The connection edge connecting the Resource Optimization stage and the Worker stage can be described as { CONCURRENT, EPHEMERAL _ STATELESS, NONE, SOURCE _ TASK _ PROGRESS }. Unlike the description of the connection edge connecting the PS stage and the Worker stage, the connection edge connecting the Resource Optimization stage and the Worker stage adopts a scheduling type in which the downstream node is SOURCE _ TASK _ PROGRESS. That is, the Worker node needs to schedule after the execution progress of the upstream Resource Optimization node reaches a certain threshold.
In the process of executing the job, the PS node and the Resource Optimization node are started first, when the instance of the PS node is started, the downstream Worker node is informed to carry out scheduling, and when the instance progress of the Resource Optimization node reaches a threshold value, the downstream Worker node is informed to carry out scheduling. After receiving the two notifications, the Worker node can update the scheduled resources and then start corresponding instances (the resources dynamically change based on the control of the Resource Optimization node) based on the notification of the Resource Optimization node. In some embodiments, the Resource Optimization node may dynamically adjust the following resources of the Worker node: node concurrency, and the use requirements of resources such as GPU, CPU, MEMORY and the like of the nodes.
As an alternative implementation, fig. 8B illustrates another flowchart of the distributed job adjustment method provided in this embodiment of the present application. The method may be implemented by a master node. Referring to fig. 8B, the method flow may include the following steps.
In step S810, a deep learning job submitted by a user is acquired.
In step S811, an execution plan of the deep learning job is generated, the execution plan including a plurality of stages including: the Worker stage and Resource Optimization stage.
In some embodiments, the Resource Optimization stage inputs the Worker stage through a parallel edge.
In some further embodiments, the plurality of stages may further include: PS stage. Wherein, PS stage and Resource Optimization stage are respectively input into the Worker stage through parallel edges.
In an embodiment of the present application, a deep learning job (e.g., a PS job) can be described as a structure of PS, Worker, and Resource Optimization. That is, when an execution plan of the deep learning job is generated (the execution plan may be described by a DAG), a PS stage, a Worker stage, and a Resource Optimization stage may be included in the execution plan. And in the DAG graph, the PS stage is input into the Worker stage by a parallel edge, and the Resource Optimization stage is also input into the Worker stage by the parallel edge. In further embodiments, the descriptions of the parallel edges of the PS stage and the Worker stage, and the descriptions of the parallel edges of the Resource Optimization stage and the Worker stage can refer to the corresponding descriptions above, and are not expanded here.
In step S812, during the execution of the deep learning job, a Resource Optimization node corresponding to the Resource Optimization stage is scheduled, and Resource information currently adapted to the Worker stage is determined by the Resource Optimization node.
In the embodiment of the application, the Resource Optimization node can schedule before the Worker node, for example, the Worker node is notified to schedule when the instance progress of the Resource Optimization node reaches a threshold value. In some embodiments, after the Resource Optimization node schedules, the Resource Optimization node may determine the Resource information currently adapted to the Worker stage. In some embodiments, the Resource Optimization node may determine the Resource information of the historical use matching the current execution state of the deep learning job, and use the Resource information of the historical use matching the current execution state as the Resource information currently adapted to the Worker stage.
As an alternative implementation, the resource information of historical use that matches the current execution state of the deep-learning job may include: resource information used by the history execution state history which is the same as or similar to the current execution state.
In some embodiments, the Resource Optimization node may determine Resource information adapted to a Worker stage from a history database based on the current execution state of the deep learning job, and the history database may record Resource information actually used by the deep learning job in each historical execution state, that is, the history database may record Resource information actually used by the deep learning job whose execution is finished in each execution state. Therefore, the Resource Optimization node can determine the Resource information suitable for the current execution state of the deep learning operation based on the record in the historical database so as to carry out Resource configuration on the Worker node.
In some embodiments, the Resource Optimization node may search a historical execution state similar to or the same as the current execution state from a historical database based on the current execution state of the deep learning job, and determine Resource information actually used by the searched historical execution state as Resource information currently adapted to the Worker stage. As an optional implementation, the Resource Optimization node may first search a historical execution state that is the same as the current execution state from the historical database, and if the historical execution state is found, the Resource information corresponding to the same historical execution state recorded in the historical database is used as the Resource information currently adapted to the Worker stage; if the resource information is not found, searching a historical execution state similar to the current execution state (for example, searching a historical execution state with the minimum difference from the current execution state), and taking the resource information corresponding to the similar historical execution state recorded in the historical database as the resource information currently adapted to the Worker stage. In some embodiments, the current execution state of the deep learning job is, for example, a current learning mode of the deep learning job, characteristics of current input data (e.g., a data amount of current training data), a current number of parameter iterations remaining, and the like.
In step S813, the Resource information is configured for the Worker stage through the Resource Optimization node.
After determining the Resource information (for example, the Resource information used in history and matched with the current execution state) currently adapted to the Worker stage through the Resource Optimization node, the main node may configure the Resource information for the Worker stage in the execution plan through the Resource Optimization node, so that the Resource information configured by the Worker stage is adapted to the current execution state of the deep learning job. Furthermore, when the Worker node is scheduled, the Worker node can schedule the used resources based on the resource information, so that the Worker node executes task with reasonable resources, and the resource utilization rate of the Worker node is improved.
In some embodiments, the Worker node needs to schedule after the instance of the PS node is started and when the progress of the instance of the Resource Optimization node reaches a threshold.
In some embodiments, if the user specifies the Resource information of the Worker stage in advance, the Resource Optimization node may adjust the Resource information specified by the user based on the determined Resource information currently adapted to the Worker stage, so as to avoid the situation that the Resource is idle due to the fact that the user specifies the Resource of the Worker stage excessively. For example, the user specifies Resource units of 2 CPU cores (cores) and 1 GPU cores (cores) used by the Worker node, and if the Resource Optimization node determines that the Worker node does not actually need to use so many resources based on the current execution state of the deep learning job, the Resource Optimization node may adjust the number of CPU cores and GPU cores that the user previously specified, for example, to 1 CPU core and half GPU core used by the Worker node, so as to avoid idle waste of CPU and GPU resources due to the fact that the user specifies too many CPU cores and GPU cores for the Worker node.
As some implementation examples, FIG. 8C illustrates a Resource Optimization node adjusting a Resource example of a Worker node. As shown in fig. 8C, the resources planned and configured by the Worker node are 200 CPU cores and 100 GPU cores (which may be specified by a user), and during the execution of the deep learning job, the Resource Optimization node may search, from the history database, Resource information actually used by a similar or same history execution state, for example, the history execution state similar to or the same as the current execution state of the job actually uses 100 CPU cores and 50 GPU cores, so that the Resource Optimization node may adjust the number of CPU cores configured by the Worker node to 100 and the number of GPU cores to 50, so that the Worker node can guarantee the specific execution of the deep learning job by configuring lower resources.
In some further embodiments, the Resource Optimization node may preset a plurality of Resource requirement schemes of the Worker node, and the Resource Optimization node may select a Resource requirement scheme matching the current execution state from the plurality of Resource requirement schemes according to the current execution state of the deep learning job, and configure Resource information for the Worker stage according to the selected Resource requirement scheme. The resource demand scenario matching the current execution state may be considered as a scenario that can satisfy the execution of the job in the current execution state and has the least amount of resource demand.
According to the method and the device, resources with higher precision can be configured for the Worker node based on the current execution state of the deep learning operation, the computing resources of the distributed system can be saved on the premise that the execution performance of the Worker node for executing the deep learning operation is not affected, and the resource utilization rate of the distributed system is improved. After the method and the device for processing the deep learning operation are used on line in the large-scale production cluster, the utilization rate of resources (particularly GPU resources) of the deep learning operation can be greatly improved on the premise that the time consumption of actual training of a user for the deep learning is not influenced, meanwhile, the throughput rate of the operation is greatly improved, and the condition of user operation queuing can be greatly relieved.
According to the embodiment of the application, the execution plans such as Map-Reduce and offline operation modes can be described and expanded, so that the execution plans can accurately describe deep learning operation including PS. The description mode of the execution plan can avoid idle waiting of resources of the working nodes and can provide more accurate job execution control flow. In addition, the method and the device can dynamically select and adjust the resources such as the GPU and the like which are needed to be actually used by the deep learning operation during the operation, ensure that the use of the resources can be adapted according to the actual operation requirements and the algorithm characteristics of the operation, and ensure efficient resource use, thereby avoiding the contradiction of excessive resource application and lower actual resource utilization rate in a large-scale multi-user distributed system.
The embodiment of the present application further provides a host node, where the host node may be configured to execute the distributed job adjustment method provided in the embodiment of the present application.
The embodiments of the present application further provide a distributed system, where the structure of the distributed system may be combined with the description in the corresponding section, and the distributed system may include the above-mentioned master node.
The embodiment of the application further provides a physical machine, and the physical machine can be provided with the master node provided by the embodiment of the application. As an alternative implementation, FIG. 9 shows a block diagram of a physical machine. As shown in fig. 9, the physical machine may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4. In the embodiment of the present application, the number of the processor 1, the communication interface 2, the memory 3, and the communication bus 4 is at least one, and the processor 1, the communication interface 2, and the memory 3 complete mutual communication through the communication bus 4. Alternatively, the communication interface 2 may be an interface of a communication module for performing network communication. Alternatively, the processor 1 may be a CPU (central Processing Unit), a GPU (Graphics Processing Unit), an NPU (embedded neural network processor), an FPGA (Field Programmable Gate Array), a TPU (tensor Processing Unit), an AI chip, an asic (application Specific Integrated circuit), or one or more Integrated circuits configured to implement the embodiments of the present application. The memory 3 may comprise a high-speed RAM memory and may also comprise a non-volatile memory, such as at least one disk memory. The memory 3 stores one or more computer-executable instructions, and the processor 1 calls the one or more computer-executable instructions to execute the distributed job adjustment method provided by the embodiment of the present application.
Embodiments of the present application further provide a storage medium, where the storage medium may store one or more computer executable instructions, and when the one or more computer executable instructions are executed, the distributed job adjustment method provided in the embodiments of the present application is implemented.
The embodiment of the present application further provides a computer program, where the computer program is used to execute the distributed job adjustment method provided in the embodiment of the present application.
While various embodiments provided by the embodiments of the present application have been described above, various alternatives described in the various embodiments can be combined and cross-referenced without conflict to extend the variety of possible embodiments that can be considered disclosed and disclosed in the embodiments of the present application.
Although the embodiments of the present application are disclosed above, the present application is not limited thereto. Various changes and modifications may be effected therein by one of ordinary skill in the pertinent art without departing from the scope or spirit of the present disclosure, and it is intended that the scope of the present disclosure be defined by the appended claims.

Claims (10)

1. A distributed job adjustment method, comprising:
obtaining deep learning operation;
generating an execution plan for the deep learning job, the execution plan including a plurality of execution phases including: a working machine execution phase and a resource optimization execution phase; the working device execution phase is used for calculating the gradient of the deep learning parameter;
in the execution process of the deep learning operation, scheduling a resource optimization node corresponding to the resource optimization execution stage, and determining the resource information which is used historically and is matched with the current execution state of the deep learning operation through the resource optimization node;
and configuring the resource information for the execution stage of the working machine through a resource optimization node.
2. The method of claim 1, wherein the execution plan is described by a DAG; the plurality of execution phases further comprises: a parameter server execution phase; the parameter server execution stage inputs the working machine execution stage through a parallel edge, the resource optimization execution stage inputs the working machine execution stage through a parallel edge, and working nodes of an upstream execution stage and a downstream execution stage connected by the parallel edge can be in a running state at the same time.
3. The method of claim 2, wherein the downstream worker node schedules when an instance of the upstream parameter server node is started and the execution progress of the upstream resource optimization node reaches a certain threshold; the worker nodes are corresponding to the worker execution stage, the parameter server nodes are corresponding to the parameter server execution stage, and the resource optimization nodes are corresponding to the resource optimization execution stage.
4. The method of claim 1, wherein the determining, by the resource optimization node, historically used resource information that matches a current execution state of a deep learning job comprises:
based on the current execution state of the deep learning operation, searching a historical execution state similar to or the same as the current execution state from a historical database, and determining the resource information actually used by the searched historical execution state as the resource information currently adapted to the execution stage of the working machine; wherein the history database records the resource information actually used by the history deep learning operation in each history execution state.
5. The method of claim 3, further comprising:
and when the worker node schedules, the worker node schedules the used resources based on the resource information.
6. The method of claim 1, further comprising:
and if the user specifies the resource information of the execution stage of the working device in advance, the resource optimization node adjusts the resource information specified by the user based on the determined resource information.
7. A master node, wherein the master node is configured to perform the distributed job adjustment method of any one of claims 1-6.
8. A distributed system, wherein the distributed system comprises a master node and a plurality of worker nodes, the master node being the master node of claim 7.
9. A physical machine, wherein the physical machine comprises at least one memory and at least one processor, the memory storing one or more computer-executable instructions that the processor invokes to perform the distributed job adjustment method of any of claims 1-6.
10. A storage medium, wherein the storage medium stores one or more computer-executable instructions that, when executed, implement a distributed job adjustment method as recited in any of claims 1-6.
CN202111583453.8A 2021-08-18 2021-08-18 Distributed job adjustment method, master node, system, physical machine, and storage medium Pending CN114490027A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111583453.8A CN114490027A (en) 2021-08-18 2021-08-18 Distributed job adjustment method, master node, system, physical machine, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111583453.8A CN114490027A (en) 2021-08-18 2021-08-18 Distributed job adjustment method, master node, system, physical machine, and storage medium
CN202110950182.9A CN113407354B (en) 2021-08-18 2021-08-18 Distributed job adjustment method, master node, system, physical machine, and storage medium

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN202110950182.9A Division CN113407354B (en) 2021-08-18 2021-08-18 Distributed job adjustment method, master node, system, physical machine, and storage medium

Publications (1)

Publication Number Publication Date
CN114490027A true CN114490027A (en) 2022-05-13

Family

ID=77688654

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202111583453.8A Pending CN114490027A (en) 2021-08-18 2021-08-18 Distributed job adjustment method, master node, system, physical machine, and storage medium
CN202110950182.9A Active CN113407354B (en) 2021-08-18 2021-08-18 Distributed job adjustment method, master node, system, physical machine, and storage medium

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202110950182.9A Active CN113407354B (en) 2021-08-18 2021-08-18 Distributed job adjustment method, master node, system, physical machine, and storage medium

Country Status (1)

Country Link
CN (2) CN114490027A (en)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10108683B2 (en) * 2015-04-24 2018-10-23 International Business Machines Corporation Distributed balanced optimization for an extract, transform, and load (ETL) job
CN106815071A (en) * 2017-01-12 2017-06-09 上海轻维软件有限公司 Big data job scheduling system based on directed acyclic graph
CN108762902A (en) * 2018-05-22 2018-11-06 齐鲁工业大学 Multi-scenario tasks dispatching method and device in Distributed Calculation based on Spark
CN110502337B (en) * 2019-07-12 2023-02-07 上海交通大学 Optimization system for shuffling stage in Hadoop MapReduce
CN110673794B (en) * 2019-09-18 2021-12-17 中兴通讯股份有限公司 Distributed data equalization processing method and device, computing terminal and storage medium
CN113157413B (en) * 2021-04-16 2022-04-26 上海交通大学 Deep learning task resource optimization configuration method and system based on service quality requirement

Also Published As

Publication number Publication date
CN113407354B (en) 2022-01-21
CN113407354A (en) 2021-09-17

Similar Documents

Publication Publication Date Title
US11249997B1 (en) System-wide query optimization
Khorasani et al. Scalable simd-efficient graph processing on gpus
CN107679192B (en) Multi-cluster cooperative data processing method, system, storage medium and equipment
CN107111653B (en) Query optimization of system memory load for parallel database systems
US7499960B2 (en) Adaptive memory allocation
US6601058B2 (en) Data exploration system and method
US6505187B1 (en) Computing multiple order-based functions in a parallel processing database system
US10204140B2 (en) Massively parallel and in-memory execution of grouping and aggregation in a heterogeneous system
US8001109B2 (en) System and method for automating data partitioning in a parallel database
CN101359333B (en) Parallel data processing method based on latent dirichlet allocation model
JPH06214843A (en) Data base management system and processing method for inquiry
WO2000020995A1 (en) Improved optimizations in a data exploration system and method
CN110874271B (en) Method and system for rapidly calculating mass building pattern spot characteristics
Wang et al. Elastic pipelining in an in-memory database cluster
CN104778077A (en) High-speed extranuclear graph processing method and system based on random and continuous disk access
CN115114294A (en) Self-adaption method and device of database storage mode and computer equipment
CN116756150B (en) Mpp database large table association acceleration method
CN108108242B (en) Storage layer intelligent distribution control method based on big data
Wang et al. Adaptive time, monetary cost aware query optimization on cloud database systems
CN112000845B (en) Hyperspatial hash indexing method based on GPU acceleration
CN113407354B (en) Distributed job adjustment method, master node, system, physical machine, and storage medium
CN116089414B (en) Time sequence database writing performance optimization method and device based on mass data scene
CN110928648B (en) Heuristic and intelligent computing-fused cloud workflow segmentation online scheduling optimization method
Azez et al. JOUM: an indexing methodology for improving join in hive star schema
CN114063888A (en) Data storage system, data processing method, terminal and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination