CN110659123B

CN110659123B - Distributed task distribution scheduling method and device based on message

Info

Publication number: CN110659123B
Application number: CN201911196296.8A
Authority: CN
Inventors: 施凡; 李阳; 宁剑; 王岩; 李振汉; 胡淼
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2020-03-20
Anticipated expiration: 2039-11-29
Also published as: CN110659123A

Abstract

The invention provides a distributed task distribution scheduling method and a distributed task distribution scheduling device based on messages, wherein the method comprises the following steps: extracting attributes of the task and dividing the task into a plurality of atomic tasks; distributing the atomic task to a corresponding working node for execution; and carrying out distribution retry for limited times on the atomic tasks with execution failure according to the execution condition of the atomic tasks. By adopting the technical scheme of the invention, the problems that the task distribution scheduling technology in the prior art is lack of a task segmentation function and the task distribution scheduling control is inflexible are effectively solved.

Description

Distributed task distribution scheduling method and device based on message

Technical Field

The invention belongs to the field of internet and distributed systems, and relates to a distributed task distribution scheduling method based on messages.

Background

Currently, a distributed detection technology is generally adopted in the field of internet information detection, and by means of distributed deployment of work nodes, a distribution scheduling server comprehensively considers the number of tasks, work node calculation resources and network resources and distributively schedules the work nodes to execute internet information detection tasks. The task distribution scheduling comprises task distribution scheduling strategies such as system load balancing, task priority and distribution rate control, a failed task retry mechanism and the like. The system load balancing refers to adjusting task distribution scheduling strategies according to system loads such as working node resource use conditions, system session quantity, system network resource occupation conditions and the like. The task priority and distribution rate control aims to realize refined task distribution and scheduling through parameters such as task priority and distribution rate. The failure task retry mechanism is a mechanism for performing retry for a limited number of times after a specific working node fails to execute a task, and is used for enhancing the fault tolerance of the system. In a distributed internet information detection system, the advantages and disadvantages of a task distribution scheduling mechanism directly influence the exertion of the functions and performances of the system.

However, the following problems generally exist in the existing task distribution scheduling method: one is the lack of accurate task segmentation functionality. Because the input parameters of the internet information detection plug-in are various, the existing task distribution scheduling method does not divide the tasks, so that the logic of the detection plug-in function is complex and the functions of different detection plug-ins are repeated, and the overall efficiency of the system is reduced; and secondly, the task distribution scheduling control is not flexible. Although the existing task distribution scheduling method considers that the distribution scheduling strategy is adjusted according to the resource use condition of the working node, the task distribution scheduling strategy with different priorities is not distinguished, and meanwhile, a management and control mechanism for a failed task is lacked, so that a high-priority task or an emergency task cannot be preferentially executed, the result of the failed task is lost or becomes a zombie task and cannot be normally finished, the task scheduling reliability is reduced, and the user experience is also influenced.

Disclosure of Invention

In order to solve the technical problem, the invention provides a distributed task distribution scheduling method based on messages, which comprises the following steps:

step 1: acquiring an overall attribute value of a task, determining the category of the task according to the overall attribute value, detecting by a detection plug-in to acquire various attributes and corresponding attribute values of the task based on the category of the task, and extracting the attributes and the characteristics of the attribute values of the task, the historical tasks of the category to which the task belongs so as to acquire the characteristic vector representation of the task and the historical tasks of the category; dividing each historical task of the category into at least one subclass based on different modes for dividing the subtasks; comparing the deviation of the feature vector of the task with the mean value of the feature vectors of the historical tasks in each subclass, and determining the subclass with the minimum deviation value of the feature vectors of the task; comparing the deviation value with a preset relevance threshold, and if the deviation value is smaller than or equal to the preset relevance threshold, the task belongs to the subclass with the minimum deviation value; acquiring a subclass to which the task belongs, and performing task segmentation on the task according to a subtask division mode of a historical task in the subclass, namely segmenting the task into a plurality of atomic tasks and marking the task as the historical task; if the deviation value is larger than a preset relevance threshold value, establishing a subclass, adding the task into the established subclass, performing semantic analysis on the task, dividing each subtask of the task according to the result of the semantic analysis, namely dividing the task into a plurality of atomic tasks, and marking the task as a historical task;

step 2: dividing the task into a plurality of atomic tasks and synchronously distributing and scheduling the atomic tasks; message queues are established between a message server and a plurality of working nodes; the message server distributes the atomic tasks to the working nodes through the message queue, and the working nodes execute the corresponding atomic tasks;

and step 3: and carrying out distribution retry for limited times on the atomic tasks with execution failure according to the execution condition of the atomic tasks.

Further, on the basis of the above technical solution, the category at least includes one of: uniform resource locator URL class, site class, internet protocol address IP class.

Further, on the basis of the above technical solution, the URL class includes attributes: url, method, data, referrer and html, wherein the url attribute identifies a website address, the method attribute identifies an HTTP method adopted by website access, the data attribute identifies a website input parameter, the referrer attribute identifies a recommender field in an access message, and the html attribute identifies html data returned by the website;

the site class includes attributes: host, host _ main, IP, port, header, html, title, html _ encode and metadata, wherein the host attribute identifies the site domain name, the host _ main attribute identifies the site main domain name, the IP attribute identifies the site IP address, the port attribute identifies the site service port, the header attribute identifies the site access message header information, the title attribute identifies the site title, the html _ encode attribute identifies the site html document code, and the metadata attribute identifies the site html document metadata;

the IP class includes attributes: IP and IP _ info, wherein the IP attribute identifies the IP address and the IP _ info attribute identifies the geographical location information of the IP address.

Further, on the basis of the above technical solution, the categories may also be customized based on a combination of the attributes.

Further, on the basis of the above technical solution, the allocating to the plurality of working nodes includes the following steps:

step 2.1: judging whether an idle working node exists or not; if yes, selecting at least one from the plurality of atomic tasks to be allocated to the idle working node;

step 2.2: judging whether a working node with the load lower than a specific threshold exists, if so, selecting at least one working node with the load lower than the specific threshold distributed to the task amount from the plurality of atomic tasks;

step 2.3: and judging whether the multiple atomic tasks are completely distributed, if not, turning to the step 2.1, otherwise, exiting the distribution operation.

Further, on the basis of the above technical solution, the step of determining whether there is a working node with a task load lower than a specific threshold includes:

detecting resources consumed by the plug-ins on the working nodes;

detecting resources held by the working nodes;

calculating the load of the working node according to the resources consumed by the detected plug-in and the held resources,

and comparing the load with a specific threshold value, and determining whether the task load of the working node is lower than the specific threshold value.

Further, on the basis of the above technical solution, the selecting at least one of the plurality of atomic tasks in step 2.1 and step 2.2 includes the following steps:

determining the priority of the atomic task according to the priority of the task to which the atomic task belongs;

and selecting at least one with the highest priority from the atomic tasks according to the priorities of the atomic tasks.

Further, on the basis of the above technical solution, the step 3 includes:

step 3.1: setting an initial value of an execution counter for each distributed atomic task, and distributing the initial value to a working node along with the atomic task, wherein the initial value is a positive number;

step 3.2: the work node which receives the distributed atomic task executes the atomic task, if the work node fails to execute the atomic task, the execution counter value is reduced by 1, and the atomic task is sent back to the message server;

step 3.3: the message server receives the counter value, detects whether the value of the execution counter is equal to 0,

if not, sending the atomic task corresponding to the counter value to a task message queue to be distributed for queuing so as to distribute to the working nodes again, and turning to the step 3.2;

if yes, go to step 3.4;

step 3.4: stopping distributing the atomic task.

Further, on the basis of the technical scheme, different atomic task distribution rates are set for the tasks with different priorities, and the atomic task distribution rate of the task with the high priority is higher than that of the task with the low priority.

On the other hand, the invention also provides a message-based distributed task scheduling device, which comprises a processor and a memory, wherein the memory is provided with a medium stored with program codes, and when the processor reads the program codes stored in the medium, the device can execute the method of the technical scheme.

By adopting the distributed task distribution scheduling method and device based on the message, the problems that the task distribution scheduling technology in the prior art lacks a task segmentation function and the task distribution scheduling control is inflexible are solved, and the following technical effects are realized: (1) the method can segment and extract the parameter attributes of the detection tasks, and different types of tasks freely define task attribute combinations, so that the task segmentation is realized, the coupling between detection plug-ins is effectively reduced, and the system distribution scheduling efficiency is improved; (2) the method can set different priorities for different tasks, and set different distribution scheduling strategies according to the priorities, thereby enhancing the task scheduling flexibility; (3) the method can retry the tasks which fail to be executed for a limited number of times, thereby improving the task scheduling reliability of the system and enhancing the user experience.

Drawings

Fig. 1 is a schematic flow chart of a distributed task distribution scheduling method based on messages according to the present invention;

fig. 2 is a schematic diagram illustrating task classification based on task attributes in the distributed task distribution scheduling method based on messages according to the present invention;

fig. 3 is a schematic view of load balancing in the distributed task distribution scheduling method based on messages according to the present invention;

fig. 4 is a schematic diagram of automatic retry distribution of a task in the message-based distributed task distribution scheduling method according to the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings.

In the existing distributed task distribution scheduling, firstly, due to lack of a task segmentation function, the detection plug-in function logic is complex and the functions of different detection plug-ins are repeated, so that the overall efficiency of the system is reduced; secondly, different priorities of tasks are not distinguished, and a management and control mechanism for failed tasks is lacked, so that high-priority tasks or urgent tasks cannot be executed preferentially, the results of the failed tasks are missing or called zombie tasks and cannot be ended normally, the task scheduling reliability is reduced, and meanwhile, the user experience is also influenced.

Aiming at the problems, the distributed task distribution scheduling method based on the message provided by the invention divides the task by extracting the atomic attribute of the task, and customizes the task of a specific type by combining the atomic attributes of the task; according to the task priority, the tasks with high priority are scheduled preferentially, and meanwhile, the detection resources of the system in idle time are fully utilized, so that the system performance is improved; and the failed task is retried for a limited number of times according to the task execution condition, so that the task cannot be normally completed due to node failure or line failure.

The present invention will be further described in the following detailed description in order to facilitate understanding of the inventive concept and technical solutions of the present invention. Although the following examples are typical but non-limiting embodiments of the present invention, it should be specifically noted that the embodiments listed in the description of the present invention are only exemplary embodiments given for convenience of description, and should not be construed as the only correct embodiments of the present invention, and should not be construed as limiting the scope of the present invention.

Referring to fig. 1, the distributed task distribution scheduling method based on messages provided by the present invention mainly includes:

step 1: acquiring an overall attribute value of a task, determining the category of the task according to the overall attribute value, detecting by a detection plug-in unit based on the category of the task to acquire various attributes and corresponding attribute values of the task, and extracting the attributes and the characteristics of the attribute values of the task, the historical tasks of the category to which the task belongs so as to acquire the characteristic vector representation of the task and the historical tasks of the category; dividing each historical task of the category into at least one subclass based on different modes for dividing the subtasks; comparing the deviation of the feature vector of the task with the mean value of the feature vectors of the historical tasks in each subclass, and determining the subclass with the minimum deviation value of the feature vectors of the task; comparing the deviation value with a preset relevance threshold, and if the deviation value is smaller than or equal to the preset relevance threshold, the task belongs to the subclass with the minimum deviation value; acquiring a subclass to which the task belongs, and performing task segmentation on the task according to a subtask division mode of a historical task in the subclass, namely segmenting the task into a plurality of atomic tasks and marking the task as the historical task; if the deviation value is larger than a preset relevance threshold value, a subclass is newly built, the task is added into the newly built subclass, semantic analysis is conducted on the task, each subtask of the task is divided according to the result of the semantic analysis, namely the task is divided into a plurality of atomic tasks, and the task is marked as a historical task.

According to one embodiment of the invention, historical data of various tasks is stored, each task of various classes is divided into at least one subclass according to different task dividing modes, and the historical data of various tasks are stored in the corresponding subclass.

And obtaining the integral attribute value of a new task, wherein the integral attribute value is obtained by taking the task as a whole and carrying out Hash calculation on all information in the task. For example, a new task is a specific URL address, the value obtained by hash calculation of the URL address is the obtained overall attribute value of the new task, and the task is determined to belong to the URL class according to the overall attribute value; and then acquiring each attribute of the task and the attribute value corresponding to each attribute, wherein the detection plug-in can detect and acquire each attribute of the task and the corresponding attribute value based on the category of the task. And then, acquiring the previously stored historical data which processes the URL task, acquiring each attribute and corresponding attribute value of each historical task from the historical data of the URL task, and further clustering the new task and each subclass of the task to judge the subclass closest to the new task. The feature vector representation of the new task and the historical tasks in each historical data can be obtained by extracting the attributes and the features of the attribute values of the new task and the historical data. And for each subclass, calculating the characteristic vector mean value of all the historical tasks of the subclass according to the characteristic vectors of all the historical tasks contained in the subclass. And comparing the deviation of the feature vector of the new task with the mean value of the feature vectors of the historical tasks of all the subclasses, and determining the subclass with the minimum deviation value of the feature vector of the new task. And comparing the deviation value with a preset relevance threshold, and if the deviation value is less than or equal to the preset relevance threshold, the new task belongs to the subclass with the minimum deviation value. And acquiring the subordinate subclasses of the tasks in the class, performing task segmentation on the new task according to the subtask division mode of the historical tasks in the subclasses, and marking the new task as the historical task. And if the deviation value is larger than a preset relevance threshold, establishing a subclass, adding the new task into the subclass, performing semantic analysis on the task, dividing each subtask of the new task according to the result of the semantic analysis, and marking the new task as a historical task.

A more preferred embodiment of step 1 of the present invention is to abstract the task input into three categories: URL class, site class, IP class. And respectively extracting task attributes according to the three types of task input types. In addition, new different types of tasks can be constructed by combining the three types of input attributes. The task segmentation process and the task distribution scheduling are synchronously performed, namely, each attribute of a task is obtained by scheduling a specific detection plug-in to detect through providing initial input. And after the task attributes are obtained, scheduling the detection plug-in taking the corresponding task attributes as input for detection.

Referring to fig. 2, a preferred embodiment abstracts task inputs into three classes: URL class, site class, IP class. And respectively extracting task attributes aiming at the three types of task input types: the URL class input attribute is a multi-element group formed by { URL, method, data, referrer, html } and the like, wherein the URL attribute identifies a website address, the method attribute identifies an HTTP method adopted by website access, the data attribute identifies a website input parameter, the referrer attribute identifies a recommender field in an access message, and the html attribute identifies html data returned by the website. The site class input attribute is a multi-element group formed by { host, host _ main, IP, port, header, html, title, html _ encode, metadata } and the like, wherein the host attribute identifies a site domain name, the host _ main attribute identifies a site main domain name, the IP attribute identifies a site IP address, the port attribute identifies a site service port, the header attribute identifies site access message header information, the title attribute identifies a site title, the html _ encode attribute identifies site html document coding, and the metadata attribute identifies site html document metadata. The IP type input attribute is a binary group formed by { IP, IP _ info } and the like, wherein the IP attribute identifies an IP address, and the IP _ info attribute identifies the geographic position information of the IP address. As shown in fig. 1. Different tasks are constructed by combining the three types of input attributes. The task segmentation process and the task distribution scheduling are synchronously performed, namely, each attribute of a task is obtained by scheduling a specific detection plug-in to detect through providing initial input. And after the task attributes are obtained, scheduling the detection plug-in taking the corresponding task attributes as input for detection.

A more preferred embodiment of step 2 of the present invention is to implement task load balancing based on a message queue, and a message server is responsible for distributing tasks to work nodes. When a task is started, the task is divided into a plurality of atomic tasks to the working nodes by the task division method. The message server preferentially allocates tasks to the idle working nodes, calculates the task amount and load of the working nodes according to the resources consumed by the detection plug-in and the resources held by the working nodes, wherein the resources comprise calculation resources, storage resources, network resources, thread resources, session resources and the like. Meanwhile, the task priority is set to three levels, the message server sets different distribution rate thresholds for the tasks of each level according to the task priority, and the tasks with high priority are distributed to the working nodes with low working load preferentially so as to be executed more quickly.

Referring to fig. 3, task load balancing is implemented based on a message queue, and a message server is responsible for distributing tasks to work nodes. When an original task is started, the original task is divided into a plurality of atomic tasks to the working nodes by the task division method. The load balancing method comprises the following steps: firstly, the message server can preferentially distribute tasks to idle working nodes; secondly, the message server calculates the work node load according to the resources consumed by the detection plug-ins and the resources held by the work nodes, wherein the resources comprise calculation resources, storage resources, network resources, thread resources, conversation resources and the like; meanwhile, the original task priority is set to three levels, the message server sets different distribution rate thresholds for the original tasks of each level according to the original task priority, and the atomic tasks corresponding to the original tasks with high priority are preferentially distributed to the working nodes with lower working loads, so that the atomic tasks can be executed more quickly.

A more preferred embodiment of step 3 of the present invention is to implement task retry using a task execution counter. Initially, the execution counter for each task is assigned a particular threshold. The task execution counter threshold value is distributed to the work nodes along with the task, if the work nodes cause the task execution failure just because of the network fluctuation or other problems, the task execution counter will be reduced by 1, the corresponding task will return to the task message queue to be distributed of the message server again, the message server will distribute the task again until the task returns the correct data, or the task execution counter is 0, the distribution is automatically stopped.

Referring to fig. 4, task retry may be implemented at an atomic level after a task failure. The task retry is realized by adopting a task execution counter, and comprises the following processes: under the initial condition, the execution counter of each atomic task is endowed with a specific threshold value, and the threshold value of the task execution counter is distributed to the working nodes along with the tasks; if the work node fails to execute the atomic task just because of network fluctuation or other problems, the work node subtracts 1 from the current task execution counter and sends the current task execution counter to the message server, and the message server puts the returned atomic task into a task queue to be distributed to queue for re-distribution scheduling; and the message server distributes the atomic task again until the atomic task returns correct data or automatically stops distribution when the task execution counter is 0.

It will be evident to those skilled in the art that the embodiments of the present invention are not limited to the details of the foregoing illustrative embodiments, and that the embodiments of the present invention are capable of being embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the embodiments being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. Several units, modules or means recited in the system, apparatus or terminal claims may also be implemented by one and the same unit, module or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the embodiments of the present invention and not for limiting, and although the embodiments of the present invention are described in detail with reference to the above preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the embodiments of the present invention without departing from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A distributed task distribution scheduling method based on messages is characterized by comprising the following steps:

step 2: dividing the task into a plurality of atomic tasks and synchronously distributing and scheduling the atomic tasks; message queues are established between a message server and a plurality of working nodes; setting task priorities to three levels, setting different distribution rate thresholds for tasks of each level by the message server according to the task priorities, distributing the tasks with high priorities to the working nodes by the message queue preferentially, and executing the corresponding atomic tasks by the working nodes;

2. The method of claim 1, wherein the categories include at least one of: uniform resource locator URL class, site class, internet protocol address IP class.

3. The method of claim 2, wherein:

the URL class includes attributes: url, method, data, referrer and html, wherein the url attribute identifies a website address, the method attribute identifies an HTTP method adopted by website access, the data attribute identifies a website input parameter, the referrer attribute identifies a recommender field in an access message, and the html attribute identifies html data returned by the website;

4. The method of claim 3, wherein the categories are further customizable based on a combination of the attributes.

5. The method of claim 3, wherein assigning to the plurality of worker nodes comprises the steps of:

6. The method of claim 5, wherein determining whether there are working nodes with a workload load below a certain threshold comprises the steps of:

detecting resources consumed by the plug-ins on the working nodes;

detecting resources held by the working nodes;

7. The method of claim 6, wherein the selecting at least one of the plurality of atomic tasks in step 2.1 and step 2.2 comprises the steps of:

8. The method of claim 7, wherein said step 3 comprises:

if yes, go to step 3.4;

step 3.4: stopping distributing the atomic task.

9. The method of claim 8, wherein different atomic task distribution rates are set for tasks of different priorities, the atomic task distribution rate for a task with a high priority being higher than the atomic task distribution rate for a task with a low priority.

10. A message-based distributed task scheduling apparatus comprising a processor and a memory, the memory having a medium with program code stored therein, the apparatus being capable of performing the method of any of claims 1-9 when the processor reads the program code stored in the medium.