CN109857535B

CN109857535B - Spark JDBC-oriented task priority control implementation method and device

Info

Publication number: CN109857535B
Application number: CN201910122390.2A
Authority: CN
Inventors: 刘欣然; 张鸿; 惠榛; 吕雁飞; 马秉楠; 李斌斌; 王振宇; 黄航; 王树鹏
Original assignee: National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2019-02-18
Filing date: 2019-02-18
Publication date: 2021-06-11
Anticipated expiration: 2039-02-18
Also published as: CN109857535A

Abstract

The invention discloses a method and a device for realizing task priority control facing Spark JDBC, wherein the method comprises the following steps: when the SparkJdbc service is started, establishing a plurality of task priority queues according to a pre-written priority queue description XML file; receiving a priority queue command which is issued by a user through a Jdbc interface, and finishing the priority setting of the Jdbc conversation level; receiving retrieval SQL submitted by a user, generating a Spark Task set after the SQL statement is subjected to a plurality of analysis planning processes, and adding the Spark Task set into a target priority queue with a corresponding name; and dispatching and allocating hardware resources by the resource scheduler according to the resource allocation strategies among the priority queues and the resource allocation strategies inside the queues, and distributing the Spark Task to the Task executors on the computing nodes for execution.

Description

Spark JDBC-oriented task priority control implementation method and device

Technical Field

The invention relates to the field of big data processing, in particular to a Spark JDBC-oriented task priority control implementation method and device.

Background

With the continuous development of computer technology and the continuous improvement of informatization degree, the data volume is rapidly increased, and the data storage and application for mass data are also developed rapidly. In the application of mass data retrieval, a distributed retrieval framework SparkJdbc of the Apache foundation provides a HiveQL interface with Hive, has higher efficiency and usability, and is widely used in the field.

After a user submits an SQL retrieval request to Spark Jdbc, an SQL statement generates an execution plan through analysis, Spark RDD is further generated, DAG conversion is carried out through Spark RDD, Spark Stage is further generated, and finally Spark Task set is generated through Stage. The Spark Task is a Task structure which is generated in Spark and can be executed in a distributed and concurrent manner, and is the minimum scheduling unit in Task execution in Spark. In the native Spark, the execution of the search SQL cannot be precisely scheduled, and Spark tasks generated by the search SQL can only be executed in sequence. The demand of the Jdbc platform for carrying out quota control and priority control on each service user cannot be met.

In summary, as the level of informatization is continuously improved, the application of big data is more and more extensive. For example, in terms of network security, a big data technology is used for analyzing network attack behaviors; in e-commerce, big data technology is used to analyze the shopping preferences or the most preferred goods of the user. The big data technology plays a positive promoting role in building a conservation-oriented society, improving the generation efficiency and the like, and sparkJdbc is widely used as an excellent big data retrieval method. However, with the continuous increase of data volume and the continuous development of big data technology, the native architecture of SparkJdbc cannot flexibly schedule resources and cannot perform priority control on retrieval SQL, which directly affects business application.

Disclosure of Invention

The embodiment of the invention provides a method and a device for realizing task priority control facing Spark JDBC, which are used for solving the problems in the prior art.

The embodiment of the invention provides a method for realizing task priority control facing Spark JDBC, which comprises the following steps:

when the SparkJdbc service is started, establishing a plurality of task priority queues according to a pre-written priority queue description XML file;

receiving a priority queue command which is issued by a user through a Jdbc interface, and finishing the priority setting of the Jdbc conversation level;

receiving retrieval SQL submitted by a user, generating a Spark Task set after the SQL statement is subjected to a plurality of analysis planning processes, and adding the Spark Task set into a target priority queue with a corresponding name;

and dispatching and allocating hardware resources by the resource scheduler according to the resource allocation strategies among the priority queues and the resource allocation strategies inside the queues, and distributing the Spark Task to the Task executors on the computing nodes for execution.

Preferably, the task priority queue includes: name, priority level, weight, and internal resource scheduling pattern.

Preferably, after the user submits the search SQL, the method further comprises:

registering the retrieval SQL task in a corresponding priority queue in a task scheduler, acquiring the real-time running SQL quantity and the task quota quantity of the priority queue, judging whether the corresponding target task priority queue exceeds the quota, and confirming that the target task priority queue does not exceed the quota; and if the corresponding target task priority queue exceeds the limit, returning information refusing to be retrieved due to exceeding of the concurrent limit to the user.

Preferably, the scheduling and allocating hardware resources according to the resource allocation policy among the priority queues by the resource scheduler specifically includes:

step 1, counting the number of available operation resources in a Spark system, including the number of CPU cores and the number of memories;

step 2, reading the attributes of each priority queue, including priority levels and weight values;

step 3, selecting the highest level from the priority levels of the priority queues;

step 4, calculating the weight value ratio of the queue with the priority level being the current highest priority level;

step 5, allocating the operation resources which are not scheduled and allocated to the priority queues according to the calculated ratio;

step 6, judging whether a priority queue with a lower priority level exists or not, and if so, entering step 7; otherwise, the flow is ended.

Step 7, judging whether operation resources which are not scheduled and distributed exist, and if so, entering step 4; otherwise, entering to end the flow;

and 8, selecting the priority level of the lower level.

Preferably, the scheduling and allocating hardware resources according to the resource allocation policy in the queue by the resource scheduler specifically includes:

step 1, reading an internal scheduling mode of a current queue, and entering step 2 if the internal scheduling mode is first-in first-out; otherwise, entering step 4;

step 2, selecting the Spark Task with the earliest creation time from the Spark Task set of the priority queue, allocating a certain amount of resources to the Task, and executing the Task;

step 3, judging whether residual resources which are not scheduled and distributed exist and Spark Task which is not subjected to resource distribution exist, if so, entering the step 2; otherwise, ending the flow;

step 4, grouping the Spark tasks in the priority queue according to the original retrieval SQL;

step 5, counting the number of Task being executed in each group of Task sets;

and 6, allocating a certain amount of resources to one Spark Task in the Spark Task group with the least number of Spark tasks being executed, which is counted in the step 5.

Step 7, judging whether residual resources which are not scheduled and distributed exist and Spark Task which is not subjected to resource distribution exist, if so, entering step 5; otherwise, the flow is ended.

An embodiment of the present invention further provides a device for implementing task priority control for Spark JDBC, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program implementing the steps of the above method when executed by the processor.

By adopting the embodiment of the invention, the problem that the spark Jdbc cannot carry out accurate and flexible resource scheduling when SQL retrieval is carried out is solved, and the retrieval SQL can be controlled according to the priority of the retrieval SQL. The method can flexibly carry out resource quota control and execution priority control on the SQL retrieval, effectively improves the usability of the system, meets the actual requirements of the current big data retrieval, and has a larger application prospect.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a block diagram of an overall process architecture in an embodiment of the present invention;

FIG. 2 is a flow chart of a priority queue add retrieve SQL in an embodiment of the invention;

FIG. 3 is a flow chart of resource scheduling and distribution among priority queues according to an embodiment of the present invention;

fig. 4 is a flowchart of scheduling and distributing resources inside a priority queue according to an embodiment of the present invention.

Detailed Description

The invention provides a method for realizing task priority control facing Spark JDBC. The method comprises the steps of establishing a plurality of task priority queues in Spark JDBC; mapping retrieval SQL submitted by a user to a method for waiting operation of a task priority queue in Spark JDBC; setting an execution limit for each task priority queue in Spark JDBC, and refusing to search SQL with the excess limit; a method for scheduling hardware resources among task priority queues according to preset priorities and weights; inside a single task priority queue, a method for scheduling hardware resources by using a first-in first-out strategy or a fair strategy. The embodiment of the invention can meet the requirement of flexibly controlling the use of hardware resources through the JDBC interface in actual service use; and the requirements of the service on the execution time sequence of the emergency task, the general task and the low-priority task are met.

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The embodiment of the invention provides a spark Jdbc-oriented task priority control implementation method, which controls the priority execution sequence of concurrent retrieval SQL in spark Jdbc through a Jdbc interface and flexibly completes scheduling control on bottom hardware resources.

To achieve the above object, in a first aspect of the present invention, a method for establishing a plurality of task priority queues in SparkJdbc is provided. The method describes the condition of each priority queue needing to be established through an XML file format. When the sparkJdbc system runs, the XML description file written in advance according to the service requirement is read, and a task priority scheduler is generated in a sparkJdbc framework. Each generated task priority queue has a plurality of attributes such as a queue name, a priority level, an execution weight, a task scheduling mode in the queue, a queue quota and the like.

The second part of the invention provides a method for mapping the retrieval SQL submitted by the user to the task priority queue to wait for operation in spark Jdbc. The method provides that the user needs to perform a set queue method in the Jdbc interface before submitting a retrieval task through the Jdbc interface. By the execution of this method, one target queue name to execute the retrieval SQL is set in the Jdbc at the session level. After the Jdbc session receives retrieval SQL which is sent subsequently, the SQL is planned and analyzed, a group of Spark Task sets are finally generated, and all tasks in the generated sets are placed in Task priority queues with corresponding names generated by the first part of the invention. The mapping from the issued retrieval SQL to the task priority queue is completed by the method.

The third part of the invention provides a method for setting execution quota for each task priority queue in spark Jdbc and refusing to search SQL with excess quota. The method can maintain the real-time quantity condition of the retrieval SQL executed by each task queue, register the real-time quantity information after Jdbc receives the retrieval SQL issued by a user, read the queue quota attribute in the corresponding task priority queue generated by the first aspect of the invention, judge whether the execution quota is exceeded or not, and refuse the retrieval SQL exceeding the quota. By the method, the task concurrency number in each priority queue can be effectively controlled, so that the tasks in the high priority queues can be rapidly executed.

The fourth part of the invention provides a method for scheduling hardware resources among task priority queues according to preset priorities and weights. The method firstly collects the number of distributed CPU cores and the number of memories owned by a Spark system, and traverses each priority queue attribute in the first part of the invention. When the resource is dispatched, the task priority dispatcher preferentially distributes the resource to the priority queue with higher priority level, and the resource in the queue is distributed by the queue with the resource. If the remaining resources still exist after the internal distribution of the priority queue of the resources is obtained, the task priority scheduler distributes the remaining resources to the priority queue of the next priority level, and the resource priority queue of the next priority level is sequentially distributed until the resources are completely distributed. If the priority levels of the two priority queues are the same, calculating the proportion of the weight occupied by the weight attributes in the two priority queues, and scheduling and distributing the resources for the two priority queues according to the proportion of the weight. By the method, hardware resources are effectively allocated to each priority queue according to the preset priority level.

The fifth part of the invention provides a method for scheduling hardware resources by using a first-in first-out strategy or a fair strategy in a single task priority queue. The method will first read the queue internal scheduling pattern in the priority queue attribute described in the first part of the present invention. After the queue is allocated with resources by the Task priority scheduler according to the fourth part of the present invention, the "first in first out" policy allocates resources to the first generated Spark Task first and sequentially to the second generated Spark Task according to the sequence of Spark Task generation time in the priority queue; the fairness strategy can group the Spark tasks in the priority queue according to retrieval SQL, and distribute tasks to the Spark tasks in each group in turn, so that the Spark tasks in each group are ensured to be performed at the same time. By the method, the sequence of task execution can be flexibly controlled according to the actual service requirement.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following preferred embodiments are described in detail with reference to the accompanying drawings.

Fig. 1 shows the architecture of the overall process.

In one embodiment, the specific architecture is as follows:

when the spark Jdbc service is started, a plurality of task priority queues are established according to a pre-written priority queue description XML file, and each queue comprises a plurality of attributes such as a name, a priority level, a weight and an internal resource scheduling mode. Before submitting retrieval SQL, a user sets a target priority queue name of a session level through a Jdbc interface. After the user submits the retrieval SQL, registration is firstly carried out to judge whether the corresponding target queue exceeds the limit. After several analysis planning processes, the SQL statement generates a Spark Task set and adds it to the priority queue of the corresponding name. And the resource scheduler performs scheduling allocation of hardware resources according to the resource allocation strategy among the priority queues and the resource allocation strategy in the queues, and distributes the Spark Task to Task executors on the computing nodes for execution.

Fig. 2 shows the flow of the priority queue adding retrieval SQL.

In one embodiment, the specific architecture is as follows:

step 201: and receiving a command of assigning a priority queue issued by a user through a Jdbc interface, and finishing the priority setting of the Jdbc conversation level.

Step 202: and receiving a retrieval SQL statement issued by a user through a Jdbc interface.

Step 203: and registering the retrieval SQL task in a corresponding priority queue in a task scheduler, and acquiring the real-time running SQL quantity and the task quota quantity of the priority queue.

Step 204: judging whether the number of the executed retrieval SQL in the user with the corresponding priority exceeds the quota number, if not, entering the step 205; otherwise step 207 is entered.

Step 205: and (3) the received SQL statements are subjected to a Spark analysis planning process to generate a Spark Task set.

Step 206: all elements in the generated Spark Task set are added to the priority queue received in step 201.

Step 207: information is returned to the user that refuses retrieval because the concurrency limit is exceeded.

Fig. 3 shows a flow of resource scheduling distribution among priority queues.

In one embodiment, the specific architecture is as follows:

step 301: and counting the number of available operation resources in the Spark system, including the number of CPU cores and the number of memories.

Step 302: and reading the attributes of each priority queue, including the priority level and the weight value.

Step 303: the highest level is selected from the priority levels of the priority queues.

Step 304: and calculating the weight value ratio of the queue with the priority level being the current highest priority level.

Step 305: and allocating the operation resources which are not subjected to scheduling allocation to the priority queues according to the ratio calculated in the step 304.

Step 306: judging whether a priority queue with a lower priority level exists, if so, entering a step 307; otherwise, the flow is ended.

Step 307: judging whether operation resources which are not scheduled and distributed exist, if so, entering a step 304; otherwise, ending the flow.

Step 308: and selecting the priority level of the lower level.

Fig. 4 shows a flow of scheduling and distributing resources inside the priority queue.

In one embodiment, the specific architecture is as follows:

step 401: reading the internal scheduling mode of the current queue, and entering step 402 if the internal scheduling mode is 'first in first out'; otherwise, go to step 404.

Step 402: and selecting the Spark Task with the earliest creation time from the Spark Task set of the priority queue, allocating a certain amount of resources to the Task, and executing the Task.

Step 403: judging whether residual resources which are not scheduled and distributed exist and Spark tasks which are not subjected to resource distribution exist, if so, entering a step 402; otherwise, the flow is ended.

Step 404: and grouping the Spark tasks in the priority queue according to the original retrieval SQL.

Step 405: the number of tasks being executed in each set of Task sets grouped in step 404 is counted.

Step 406: a certain amount of resources is allocated to one Spark Task in the Spark Task group with the least number of Spark tasks being executed, which is counted in step 405.

Step 407: judging whether residual resources which are not scheduled and distributed exist and Spark tasks which are not subjected to resource distribution exist, if so, entering a step 405; otherwise, the flow is ended.

In summary, the embodiment of the present invention conveniently and effectively controls the execution priority of the search SQL through the Jdbc interface, thereby completing flexible scheduling of the underlying hardware resources and supporting task quota control of the execution queue. The method can control the execution priority of the retrieval SQL through the Jdbc interface, and flexibly control the distribution and the scheduling of bottom hardware resources. The usability of sparkJdbc is effectively improved, and the method has strong practicability and application range in the field of big data processing and has wide application prospect. The usability of the sparkJdbc framework in retrieval is effectively improved.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for realizing task priority control facing Spark JDBC is characterized by comprising the following steps:

2. The method of claim 1, wherein the task priority queue comprises: name, priority level, weight, and internal resource scheduling pattern.

3. The method of claim 1, wherein after a user submits a search SQL, the method further comprises:

4. The method of claim 1, wherein the performing, by the resource scheduler, scheduling allocation of hardware resources according to the resource allocation policy among the priority queues specifically comprises:

step 6, judging whether a priority queue with a lower priority level exists or not, and if so, entering step 7; otherwise, ending the flow;

and 8, selecting the priority level of the lower level.

5. The method of claim 1, wherein the performing, by the resource scheduler, scheduling allocation of hardware resources according to the resource allocation policy within the queue specifically comprises:

step 5, counting the number of Task being executed in each group of Task sets;

step 6, allocating a certain amount of resources to one Spark Task in the Spark Task group with the least number of Spark tasks being executed, which is counted in the step 5;

6. An implementation apparatus for task priority control facing Spark JDBC, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the computer program realizing the steps of any of claims 1 to 5 when executed by the processor.