CN112256418B

CN112256418B - Big data task scheduling method

Info

Publication number: CN112256418B
Application number: CN202011157921.0A
Authority: CN
Inventors: 胡亚军; 邵若梅; 孙树清
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2020-10-26
Filing date: 2020-10-26
Publication date: 2023-10-24
Anticipated expiration: 2040-10-26
Also published as: CN112256418A

Abstract

The invention discloses a big data task scheduling method, which comprises the following steps: s1, dividing a plurality of big data analysis tasks into a plurality of priorities, dividing the big data analysis tasks with the same priority into the same group, and determining the complexity of each big data analysis task in each group of task groups; s2, constructing a task scheduling subprogram based on a neural network of a cyclic scheduling learning algorithm in the Hadoop computing cluster, and distributing computing resources of the Hadoop computing cluster to each big data analysis task according to priority and complexity by the task scheduling subprogram. The invention can enable the calculation cluster to reach the optimal running state in big data analysis, solves the problem of excessive preemption of the resources of the calculation task, and ensures that the calculation resources are fully utilized by timely recovering the calculation resources of the Hadoop cluster.

Description

Big data task scheduling method

Technical Field

The invention relates to the field of big data intelligent processing methods, in particular to a big data task scheduling method.

Background

When the world advances in the 5G age, data become more and more enterprise gold ores, and the gold ores are extracted into the data gold ores, so that a large data analysis technology is needed to be utilized, and various data reports are obtained by utilizing the strong calculation power of a server cluster, so that related businesses can be intuitively and clearly known and understood through the reports. With the increase of data volume, from GB to TB at the beginning and even to PB level data, a very huge large data cluster is required to meet the data analysis requirement, and the analysis requirement is also from several to tens to hundreds.

Currently, in the field of big data analysis, non-sensitive behavior data of users need to be collected under the condition of legal permission, and meanwhile, the TB level and even PB level data are analyzed and learned by utilizing a big data technology, so that a Hadoop ecological big data analysis technology is needed. Because business requires big data analysis of each dimension every day, most analysis tasks have a time dimension of analysis, such as month, week, day, time, and the like, and the larger the time dimension is, the larger the data corresponding to the analysis needs to be performed at one time, and more calculation resources are needed to obtain the analysis result in a certain time.

In the prior art, a Hadoop computing cluster is started, and a prest technology is utilized to correspondingly trigger each computing task after a specific time point of each day, but the method has various disadvantages, on one hand, the problem of mutual preemption of resources of each computing task occurs, and finally, certain analysis tasks cannot successfully obtain analysis results due to insufficient computing resources; on the other hand, since a cluster with a fixed size is started, the analysis task generally starts to run in the early morning, and the analysis result is needed to be obtained in the morning, so that the cluster is almost fully loaded to run in a period of time, but more than half of the time is idle, and resource waste occurs. Meanwhile, hundreds of tasks are operated in the condition of needing corresponding resources, if the tasks are not distinguished, some relatively important tasks cannot calculate analysis results in the expected time, but relatively less important or less urgent analysis tasks acquire more resources to quickly output the analysis results, and the situation can cause great trouble and inconvenience for big data analysis.

Disclosure of Invention

The invention aims to provide a big data task scheduling method, which solves the problem of poor utilization condition of computing resources in the process of big data task analysis in the prior art and realizes the completion of the most big data analysis business by using the least machines.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

a big data task scheduling method comprises the following steps:

s1, dividing a plurality of big data analysis tasks into a plurality of priorities according to the importance degrees of the big data analysis tasks, dividing the big data analysis tasks with the same priority into the same group to obtain a plurality of groups of task groups, and determining the complexity of each big data analysis task in each group of task groups;

s2, constructing a task scheduling subprogram based on a neural network of a cyclic scheduling learning algorithm in the Hadoop computing cluster, and distributing computing resources of the Hadoop computing cluster to each big data analysis task by the task scheduling subprogram to perform task analysis, wherein the task scheduling subprogram distribution process is as follows:

allocating computing resources to a plurality of task groups according to priorities, wherein the allocation of the computing resources is reduced according to the order of the priorities from high to low;

in each task group, according to the complexity of each big data analysis task, a plurality of big data analysis tasks with the complexity larger than a preset threshold value are respectively and exclusively analyzed by the corresponding allocated computing resources, and after the analysis of the plurality of big data analysis tasks with the complexity larger than the preset threshold value is completed, the rest big data analysis tasks are analyzed by the corresponding allocated computing resources.

Optionally, in some embodiments:

in the big data task scheduling method, in step S1, the priorities of the big data analysis tasks are divided according to the importance of the big data analysis tasks in terms of the tasks, and the higher the importance is, the higher the priority is.

The big data task scheduling method is characterized by comprising the following steps of: in step S1, the priorities of the big data analysis tasks are classified according to the analysis conclusion, and the higher the importance is, the higher the priority is.

In the big data task scheduling method, in step S1, the complexity of each big data analysis task is determined according to the amount of calculation resources which are theoretically required to be occupied when the analysis of each big data analysis task is completed within the same time period, and the greater the amount of calculation resources is occupied, the higher the complexity is.

In the big data task scheduling method, in step S1, the complexity of each big data analysis task is determined according to the event complexity, the space complexity and the total data amount required to be called of codes required for completing analysis of the big data analysis task.

In the big data task scheduling method, in each task group in step S2, a plurality of big data analysis tasks with complexity greater than a preset threshold value sequentially and exclusively analyze the corresponding allocated computing resources according to a serial sequence, and after the analysis of the plurality of big data analysis tasks with complexity greater than the preset threshold value is completed, the rest big data analysis tasks analyze the corresponding allocated computing resources according to the serial or parallel sequence.

In the big data task scheduling method, after all big data tasks in each task group are analyzed in step S2, the corresponding allocated computing resources are released and used for analyzing other big data analysis tasks.

In the big data task scheduling method, in step S2, after analysis of each big data analysis task is completed by using computing resources in the Hadoop computing cluster, the task execution condition is fed back to the task scheduling sub-program, and the task scheduling sub-program performs self-learning according to the task execution condition, so that a new task scheduling sub-program is obtained for subsequent computing resource allocation.

In the field of mobile phones, the correct knowledge of the user needs to be obtained through an effective way, so that insensitive behavior data of the user can be collected under the condition of legal permission, and meanwhile, the TB level data and even PB level data are analyzed and learned by utilizing a big data technology, so that a big data analysis technology of Hadoop ecology is needed. Since the analysis of big data in each dimension is almost performed every day, most analysis tasks have a time dimension of analysis, such as month, week, day, time, etc., and the larger the time dimension is, the larger the data to be analyzed at one time is, and if the analysis result needs to be obtained within a certain time, more calculation resources are needed.

In the past, a Hadoop computing cluster is started, and a prest technology is utilized to correspondingly trigger each computing task after a specific time point of each day, but the method has various disadvantages, on one hand, the problem of mutual preemption of resources of each computing task occurs, and finally, certain analysis tasks cannot successfully obtain analysis results due to insufficient computing resources; on the other hand, since a cluster with a fixed size is started, the analysis task generally starts to run in the early morning, and the analysis result is needed to be obtained at 8 a.m., the cluster is almost fully loaded and running in a period of time, but more than half of the time is idle, and resource waste occurs. Meanwhile, hundreds of tasks are running in need of corresponding resources, and in the past, the tasks are indistinguishable, so that some relatively important tasks cannot calculate analysis results in the expected time, and relatively less important or less urgent analysis tasks obtain more resources to quickly output analysis results, and the situation causes great trouble and inconvenience for checking and analyzing the daily operation conditions of people.

In the invention, priority determination is carried out on all big data analysis tasks, meanwhile, a task scheduling subprogram is utilized to scientifically and effectively schedule a plurality of big data analysis tasks, so that the high priority tasks can be optimally obtained to calculate calculation resources, the resources are not preempted by the tasks with low priority, meanwhile, the situation of the running process of the clusters and each big data analysis task is analyzed through intelligent self-learning of the task scheduling subprogram, the tasks are mechanically adjusted to enter and exit the serial or parallel queues according to the priority of the tasks and the resource requirement situation, and the effective running of the analysis tasks and the most efficient utilization of the big data cluster resources are ensured, thereby enabling the calculated clusters to reach the optimal running state during big data analysis.

According to the method, large data analysis tasks are decomposed, tasks are independently operated in a Hadoop computing cluster in batches, priority definition is carried out on each large data analysis task, more computing resources are allocated to groups with high priority in the Hadoop computing cluster according to the priority of the tasks, meanwhile, the characteristics of task serial and parallel are supported, the computing resources (namely serial mode) are exclusively used for some large data analysis tasks which are important and have high complexity and need more resources, and after the task operation is finished, other large data analysis tasks are operated in series or in parallel to use the computing resources. And recovering the distributed computing resources after the operation of all the big data analysis tasks of the group is finished so as to be used for other computing services, thus solving the problem of excessive preemption of the computing tasks, and simultaneously ensuring that the computing resources are fully utilized by timely recovering the computing resources of the Hadoop cluster.

Drawings

FIG. 1 is a flow chart of a method of an embodiment of the present invention.

Fig. 2 is a schematic diagram of a core algorithm of a task scheduling service according to an embodiment of the present invention.

Fig. 3 is a comparison chart of success rates of new and old task scheduling services in the embodiment of the present invention.

FIG. 4 is a diagram showing the comparison of the cost of new and old task scheduling services according to an embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the drawings and examples.

As shown in fig. 1, a big data task scheduling method includes the following steps:

s1, dividing a plurality of big data analysis tasks into a plurality of priorities according to the importance degrees of the big data analysis tasks, wherein each big data analysis task has a respective priority, and dividing the big data analysis tasks with the same priority into the same group to obtain a plurality of groups of task groups.

In step S1, the priorities of the big data analysis tasks may be classified according to their importance in terms of traffic, with higher priorities being higher.

In step S1, the priorities of the big data analysis tasks may be further classified according to the analysis conclusion, where the higher the importance, the higher the priority of the service instruction.

Wherein, each step is explained as follows:

s2, determining the complexity of each big data analysis task in each task group.

In step S2, the complexity of each big data analysis task may be determined according to the amount of computing resources that need to be occupied in theory when the analysis of each big data analysis task is completed within the same time period, for example, the complexity is higher when the amount of computing resources occupied is greater according to the memory, CPU, storage, number of computers, etc. of the computer that needs to be occupied.

In step S2, the complexity of each big data analysis task is determined according to the event complexity, the space complexity, and the total amount of data to be called of the code required for completing the analysis of the big data analysis task.

And S3, constructing a task scheduling subprogram based on a neural network of a cyclic scheduling learning algorithm in the Hadoop computing cluster, and distributing computing resources of the Hadoop computing cluster to each big data analysis task by the task scheduling subprogram to perform task analysis.

As shown in fig. 2, in the present invention, a task scheduling sub-program is constructed based on a neural network of a cyclic scheduling learning algorithm (CSL, cyclic scheduling learning), and the task scheduling sub-program performs intelligent grouping adjustment and resource allocation on a big data analysis task according to a predefined task priority, learns the execution condition of the task, performs effective try and adjustment on the task scheduling sub-program again, performs grouping adjustment and resource allocation again on the big data analysis task by reusing the latest task scheduling sub-program when the task runs next time, and through cyclic learning and adjustment, finally achieves the maximized use of resources, and simultaneously ensures that the big data analysis task can analyze the required result according to expectations, so that on one hand, the analysis efficiency of the big data is improved (as shown in fig. 3, the time of big data analysis is reduced, and efficiency is improved), and on the other hand, the calculation expense is saved (as shown in fig. 4).

In the invention, a data warehouse and a task warehouse are also built in the Hadoop computing cluster. The data warehouse collects various data of the current business system, including structured data (such as Mysql database, etc.), unstructured data (such as pictures, videos, log files, etc.), and also called data lakes. The task warehouse records the detailed information of all the current tasks for data analysis, can be stored by using a common relational database (such as Mysql and SQL Server), and an administrator can perform operations such as adding, deleting, modifying and checking the task warehouse at any time.

In the invention, the Hadoop computing cluster supports various offline technical frameworks, such as Hive, presto, impala and the like.

The task scheduling subprogram is the core for scheduling, and mainly utilizes the AI capability of the task scheduling subprogram to perform intelligent grouping and resource allocation of big data analysis tasks according to various characteristics and indexes of the tasks, so as to ensure the correct operation of the tasks and the reasonable utilization of the resources.

The task scheduling sub-process distribution process of the invention is as follows:

In the invention, in each task group, a plurality of big data analysis tasks with complexity greater than a preset threshold value sequentially and exclusively analyze the corresponding allocated computing resources according to a serial sequence, and after the analysis of the plurality of big data analysis tasks with complexity greater than the preset threshold value is completed, the rest big data analysis tasks analyze the corresponding allocated computing resources according to the serial or parallel sequence.

In the invention, after all big data tasks in each task group are analyzed, the corresponding allocated computing resources are released and used for analyzing other big data analysis tasks. While the main implementation of releasing computing resources is achieved through techniques of computing framework and data separation. The Hive table corresponding to the Hadoop computing cluster adopts an external table mode, and data are placed outside the Hadoop computing cluster, so that the data are not affected when the computing resources of the Hadoop computing cluster are released.

In the invention, after analysis of each big data analysis task is completed by using computing resources in the Hadoop computing cluster, the task execution condition is fed back to the task scheduling sub-program, and the task scheduling sub-program carries out self-learning according to the task execution condition, so that a new task scheduling sub-program is obtained for subsequent computing resource allocation.

The invention relates to an excellent task scheduling system, namely taking reliability and effectiveness of scheduling into consideration, and taking improvement space brought by resource scheduling to resource optimization and cost saving into consideration. Creating a big data analysis task, setting initial attributes of the task, and setting the initial attributes of the big data task as shown in table 1.

TABLE 1 big data task initial Property setting Table

The task scheduling sub-process predicts the allocated initial resources, and the analysis result of the big data analysis task enters the neural network of the task scheduling sub-process. Considering that the successful operation state of some big data analysis tasks is affected by the codes, the logs of the tasks with failed operation are required to be analyzed, and the situation that the tasks cannot normally operate due to the errors of the codes of the big data analysis tasks is filtered. The neural network of the task scheduling subprogram carries out continuous self-learning and adjustment, and the effectiveness of resource allocation and task scheduling is improved. The neural network of the task scheduling subprogram is not invariable, but is a dynamic self-learning process, and the level and the parameter of the neural network are self-adjusted through continuous data input, so that the adaptability of the task scheduling subprogram is continuously improved, and the correct operation rate of a big data analysis task reaches 99.9% (the situation that the correct operation cannot be carried out due to the error of a task code is not included); the calculation cost is reduced by 50 percent. By timely recycling the calculated resources, the calculation cost is reduced by 50% compared with the prior art under the condition of supporting the same big data analysis task.

By providing the cyclic scheduling learning algorithm CSL, the task execution efficiency is greatly improved, and the scale of the cluster is halved when the analysis tasks with the same scale are supported, so that the calculation cost is greatly saved, in addition, the important dimension of the task weight is added, the core analysis task is ensured to obtain resources preferentially for scheduling, and the important analysis service is ensured to be output rapidly. Of course, the system still has some shortages at present, for example, if the analysis task cannot be operated normally due to code errors, if the system cannot accurately distinguish the abnormal operation of the task from the error log due to the code errors, the model adjustment of the CSL algorithm is interfered, the accuracy of the task scheduling system is affected, the reason of the abnormal operation of the task is further determined manually at present, and the operation failure result which is not caused by cluster computing resources is filtered. Since the analysis time consumed by the big data analysis task is influenced by the conditions of data volume change, the quality of the analysis program code and the like besides the cluster resource allocation condition, more influencing factors need to be considered, and the problems are the problems to be optimized and solved next.

The embodiments of the present invention are merely described in terms of preferred embodiments of the present invention, and are not intended to limit the spirit and scope of the present invention, and various modifications and improvements made by those skilled in the art to the technical solutions of the present invention should fall within the protection scope of the present invention, and the technical contents of the protection of the present invention are all described in the claims.

Claims

1. A big data task scheduling method is characterized in that: the method comprises the following steps:

in each group of task groups, according to the complexity of each big data analysis task, a plurality of big data analysis tasks with the complexity larger than a preset threshold value are respectively and exclusively analyzed by the corresponding allocated computing resources, after the analysis of the plurality of big data analysis tasks with the complexity larger than the preset threshold value is completed, the rest big data analysis tasks are analyzed by the corresponding allocated computing resources, wherein,

in each group of task groups, a plurality of big data analysis tasks with the complexity larger than a preset threshold value are sequentially and exclusively analyzed according to serial sequences, after the analysis of the plurality of big data analysis tasks with the complexity larger than the preset threshold value is completed, the rest big data analysis tasks are analyzed according to serial or parallel sequences by using the corresponding distributed computing resources.

2. The big data task scheduling method according to claim 1, wherein: in step S1, the priorities of the big data analysis tasks are classified according to their importance in terms of traffic, and the higher the importance, the higher the priority.

3. The big data task scheduling method according to claim 1, wherein: in step S1, the priorities of the big data analysis tasks are classified according to the analysis conclusion, and the higher the importance is, the higher the priority is.

4. The big data task scheduling method according to claim 1, wherein: in step S1, the complexity of each big data analysis task is determined according to the amount of computing resources that need to be occupied in theory when the analysis of each big data analysis task is completed within the same time period, and the greater the amount of computing resources occupied, the higher the complexity.

5. The big data task scheduling method according to claim 1, wherein: in step S1, the complexity of each big data analysis task is determined according to the event complexity, the space complexity, and the total amount of data to be called of the code required for completing the analysis of the big data analysis task.

6. The big data task scheduling method according to claim 1, wherein: and step S2, after all big data tasks in each task group are analyzed, the corresponding allocated computing resources are released and used for analyzing other big data analysis tasks.

7. The big data task scheduling method according to claim 1, wherein: in step S2, after analysis of each big data analysis task is completed by using computing resources in the Hadoop computing cluster, the task execution condition is fed back to the task scheduling sub-program, and the task scheduling sub-program performs self-learning according to the task execution condition, so that a new task scheduling sub-program is obtained for subsequent computing resource allocation.