CN117234733A - Distributed system task allocation method, system, storage medium and equipment - Google Patents

Distributed system task allocation method, system, storage medium and equipment Download PDF

Info

Publication number
CN117234733A
CN117234733A CN202311270711.6A CN202311270711A CN117234733A CN 117234733 A CN117234733 A CN 117234733A CN 202311270711 A CN202311270711 A CN 202311270711A CN 117234733 A CN117234733 A CN 117234733A
Authority
CN
China
Prior art keywords
nodes
resource
ros2
service
tasks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311270711.6A
Other languages
Chinese (zh)
Inventor
刘宏刚
张清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Metabrain Intelligent Technology Co Ltd
Original Assignee
Suzhou Metabrain Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Metabrain Intelligent Technology Co Ltd filed Critical Suzhou Metabrain Intelligent Technology Co Ltd
Priority to CN202311270711.6A priority Critical patent/CN117234733A/en
Publication of CN117234733A publication Critical patent/CN117234733A/en
Pending legal-status Critical Current

Links

Abstract

The invention provides a distributed system task allocation method, a system, a storage medium and equipment, wherein the method comprises the following steps: evaluating the resource use condition of each service according to the test condition of each service of the equipment and the distributed system task so as to quantify the calculation requirement of each service; clustering each ROS2 node in the service, dividing the ROS2 nodes into a plurality of different categories according to the communication relation, determining the node correlation based on the resource use condition of each service and the quantized calculation demand, and judging the resource density degree of the different categories; placing related nodes in different categories in a ROS2 container and/or in the same device, thereby reducing communication delay between nodes; performing CPU core binding for resource intensive categories in different categories and dividing the resources of the equipment so that each part of the resources are only used for planning tasks or control tasks; thread priorities are allocated to the execution threads of different categories according to the importance and real-time requirements of the distributed system tasks.

Description

Distributed system task allocation method, system, storage medium and equipment
Technical Field
The invention relates to the technical field of robots, in particular to the technical field of distributed system task allocation of a robot operating system, and particularly relates to a distributed system task allocation method, a system, a storage medium and equipment based on ROS 2.
Background
ROS2 (ROS, robot Operating System, robot operating system)) is a software development kit that can be used to help a developer develop robot application software, and is widely appreciated by industry and academia due to the advantages of high real-time performance, free open source, support of distributed computing communication, etc., and is now widely used in various fields, such as industrial robots, service robots, autopilot, etc.
In ROS2 based distributed computing systems, load balancing refers to evenly distributing the workload among multiple computing nodes to improve the reliability and real-time of the system. Through load balancing, each computing device can realize high-efficiency resource utilization, and resource vacancy and resource utilization overload conditions are avoided.
In the current distributed system task allocation method based on ROS2, the following defects mainly exist:
1. the allocation of coarse-grained task levels has the problem of resource preemption, which leads to the fact that the real-time performance and stability of key tasks cannot be ensured, thereby leading to system risks
2. In the task allocation method, the overall task delay in the distributed system is higher due to unreasonable task allocation, and the real-time requirement of the service cannot be met.
Therefore, in order to solve the above-mentioned drawbacks and problems in the prior art, an optimized task allocation method for a distributed system needs to be provided, which solves the problem of resource preemption in coarse-grained task level allocation, avoids the problem that real-time performance and stability of a key task are not guaranteed, thereby leading to system risk, and solves the problem that the overall task delay in the distributed system is higher and thus the service real-time performance requirement cannot be met due to unreasonable task allocation.
Disclosure of Invention
It is therefore an object of the present invention to provide an improved method, system, storage medium and device for distributed system task allocation, in particular based on ROS2, which solves the above-described problems of the prior art.
Based on the above object, in one aspect, the present invention provides a distributed system task allocation method, wherein the method includes the following steps:
evaluating the resource use condition of each service according to the test condition of each service of the equipment and the distributed system task so as to quantify the calculation requirement of each service;
Clustering each ROS2 node in the service, dividing the ROS2 nodes into a plurality of different categories according to the communication relation, determining the node correlation based on the resource use condition of each service and the quantized calculation demand, and judging the resource density degree of the different categories;
placing related nodes in different categories in a ROS2 container and/or in the same device, thereby reducing communication delay between nodes;
performing CPU core binding for resource intensive categories in different categories and dividing the resources of the equipment so that each part of the resources are only used for planning tasks or control tasks;
thread priorities are allocated to the execution threads of different categories according to the importance and real-time requirements of the distributed system tasks.
In some embodiments of the distributed system task allocation method according to the present invention, evaluating the resource usage of each service according to the test condition of each service of the device and the distributed system task to quantify the computing requirements of each service further includes:
unit testing of different tasks is carried out on the computing equipment layer of the distributed system, and the CPU utilization rate, the memory usage amount and the bandwidth use requirement of each task are defined according to the testing result.
In some embodiments of the distributed system task allocation method according to the present invention, clustering each ROS2 node in a service, classifying the ROS2 nodes into a plurality of different classes according to a communication relationship, and determining node correlation and determining resource densities of the different classes based on resource usage and quantized computational requirements of each service further includes:
based on different application tasks of the ROS2, a computational graph is formed, the computational graph describes communication and data flow among ROS2 nodes, the nodes are clustered aiming at the computational graph, the clustered types are the same as the number of the computational nodes of the distributed system, and the ROS2 nodes in different types after the clustering are distributed on different computing devices.
In some embodiments of the distributed system task allocation method according to the present invention, placing related nodes in different categories in the ROS2 container and/or in the same device, thereby reducing inter-node communication latency further comprises:
nodes with different node communication delays above the node age threshold are combined using the ROS2 container such that inter-node communication is configured as intra-process communication.
In some embodiments of the distributed system task allocation method according to the present invention, placing related nodes in different categories in the ROS2 container and/or in the same device, thereby reducing inter-node communication latency further comprises:
For nodes with communication delays of different processes of different devices higher than the device aging threshold, different nodes are distributed to the same device, so that the communication among the nodes is configured as the communication among the nodes in the same device.
In some embodiments of the distributed system task allocation method according to the present invention, performing CPU core binding and partitioning resources of the device for resource intensive categories in different categories such that each portion of the resources is used only for planning tasks or control tasks further comprises:
and binding the first resource-intensive category with a first part of cores of the CPU and binding the second resource-intensive category with a second part of cores of the CPU according to the resource-intensive degree.
In some embodiments of the distributed system task allocation method according to the present invention, performing CPU core binding and partitioning resources of the device for resource intensive categories in different categories such that each portion of the resources is used only for planning tasks or control tasks further comprises:
the CPU and the memory of the device are partitioned by the control group to be divided into two resource parts, wherein the first resource part is configured to be used for planning class tasks, and the second resource part is configured to be used for controlling the class tasks.
In some embodiments of the distributed system task allocation method according to the present invention, an execution duration is preset for an execution thread;
monitoring the execution condition of an execution thread and acquiring a preset execution time length corresponding to the execution thread;
if the returned confirmation information of the execution completion of the execution thread is not received at the end of the execution duration corresponding to the preset execution duration, sending overtime alarm information;
the overtime alarm information is used for prompting a user whether to terminate the target working thread or not; and
acquiring an abnormal alarm log sent by the distributed equipment and counting the alarm probability of each statistical period according to the abnormal alarm log;
and generating an alarm probability curve according to the alarm probability of each statistical period, wherein the alarm probability curve is the ratio between the total number of abnormal alarms in the statistical period and the duration corresponding to the statistical period.
In another aspect of the present invention, there is also provided a distributed system task allocation system, including:
the equipment layer module is configured to evaluate the resource use condition of each service according to the test condition of each service of equipment and distributed system tasks so as to quantify the calculation requirements of each service;
the task layer module is configured to cluster each ROS2 node in the service, divide the ROS2 nodes into a plurality of different categories according to the communication relation, determine node correlation based on the resource use condition of each service and the quantized calculation demand, and judge the resource density degree of the different categories;
A node layer module configured to place related nodes in different categories in a ROS2 container and/or in the same device, thereby reducing inter-node communication latency;
the process layer module is configured to bind CPU cores aiming at resource-intensive categories in different categories and divide the resources of the equipment so that each part of the resources are only used for planning tasks or controlling tasks;
and the thread layer module is configured to allocate thread priorities for execution threads of different categories according to the importance and real-time requirements of the distributed system tasks.
In yet another aspect of the present invention, there is also provided a computer readable storage medium storing computer program instructions that when executed implement any of the above methods for distributed system task allocation according to the present invention.
In yet another aspect of the present invention, there is also provided a computer device including a memory and a processor, the memory storing a computer program which, when executed by the processor, performs any of the above methods of distributed system task allocation according to the present invention.
The invention has at least the following beneficial technical effects: the method is different from the existing distributed task allocation method, the two-stage task division with coarse granularity and fine granularity is used, meanwhile, the factors such as communication delay and calculation delay are considered in the division process, and by the method, the load balance in the distributed computing system can be effectively ensured, the instantaneity and the reliability of key tasks are ensured, and meanwhile, the computing resources can be fully and effectively utilized. The method can be effectively applied to the field of the distributed computing system based on the ROS2, and has higher practical value in the edge application of automatic driving, robots and the like.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are necessary for the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention and that other embodiments may be obtained according to these drawings without inventive effort for a person skilled in the art.
In the figure:
FIG. 1 shows a schematic block diagram of an embodiment of a distributed system task allocation method according to the present invention;
FIG. 2 shows a schematic block diagram of an embodiment of a distributed system task allocation method according to the present invention;
FIG. 3 illustrates an exemplary business operational relationship diagram of an embodiment of a distributed system task allocation method according to the present invention;
FIG. 4 illustrates a schematic block diagram of an embodiment of a distributed system task allocation system in accordance with the present invention;
FIG. 5 illustrates a schematic diagram of an embodiment of a computer readable storage medium implementing a distributed system task allocation method in accordance with the present invention;
fig. 6 shows a schematic hardware architecture of an embodiment of a computer device implementing a distributed system task allocation method according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention will be described in further detail with reference to the accompanying drawings.
It should be noted that, in the embodiments of the present invention, all the expressions "first" and "second" are used to distinguish two non-identical entities with the same name or non-identical parameters, and it is noted that the "first" and "second" are only used for convenience of expression, and should not be construed as limiting the embodiments of the present invention. Furthermore, the terms "comprise" and "have," and any variations thereof, are intended to cover a non-exclusive inclusion, such as a process, method, system, article, or other step or unit that comprises a list of steps or units.
The invention aims to provide a distributed system task allocation calculation method and system based on ROS2, which are used for solving the technical problems mentioned in the background art and improving the use efficiency in the development and management of a robot system.
In order to achieve the above purpose, the present invention mainly introduces a task allocation method in a distributed system in detail from five layers of a device layer, a task layer, a node layer, a process layer and a thread layer, and specifically includes the following steps:
1. Device layer: the computing device is a computing carrier for task allocation, and the CPU, memory, storage and network bandwidth of the device are important measurement units for task calculation and allocation. Task allocation design based on soft and hard collaborative optimization is a primary link, unit tests of different tasks are carried out on a computing device layer, and requirements of CPU, memory, bandwidth and the like of each task are defined according to test results.
2. Task layer: the task layer mainly considers the balance of the computing power of the device and the computing requirement of the task, and aims to perform first coarse-grained allocation on different distributed devices. The different application tasks of the ROS2 form a calculation graph, the calculation graph describes communication and data flow among the ROS2 nodes, the nodes are clustered aiming at the calculation graph, and the clustering type is the same as the number of the calculation nodes of the distributed system. And distributing the clustered ROS2 nodes of different categories on different computing devices, so as to realize the first coarse-grained task distribution.
3. Node layer: ROS2 nodes are basic execution units of tasks, and the communication time between ROS2 nodes is strongly related to the data volume of node communication, so that the communication delay between nodes is an important consideration. The communication delay between nodes is divided into three types, namely, the communication between nodes in the same process, the communication between nodes in different processes of the same equipment and the communication between nodes in different processes of different equipment. Aiming at specific actual service demands and task allocation demands, the following optimization method is mainly considered based on the allocation result of a task layer: 1) For nodes with higher communication delay of nodes in different processes, a ROS2 container mode is used to combine the nodes into a container, so that the communication mode is changed into intra-process communication; 2) For nodes with higher communication delay of different processes of different devices, different nodes are distributed to the same device, so that the communication mode of the nodes is changed into node communication in the same device. Through the two modes, the communication time delay of the nodes can be optimized, so that the real-time performance of the system is improved.
4. Process layer: the process is the minimum unit for executing task units, and the reliability and real-time operation of the process can be realized through the resource isolation and the CPU core binding of the process. Firstly, grouping resources in the equipment, considering the resource use and preemption condition of the process in the equipment, and allocating resources by using a control group (cgroup) for important nodes so as to control and guarantee the resource use of the related process. On this basis, the CPU core binding is used to bind the process to the specific CPU core, so as to reduce the scheduling and switching of the process between the CPU cores. By comprehensively using the two modes, the process resource utilization on a single device can be ensured, the resource allocation and scheduling of the process level in the device are realized, and the real-time requirement of the system is ensured.
5. Thread layer: multiple threads in a single process may implement thread scheduling order and priority by assigning priorities. Different ROS2 nodes of the task can configure the priority of the thread due to different functional attributes and calculation attributes, and the priority of the thread is configured on the premise of ensuring the safety and stability of the thread according to the importance and real-time requirements of the thread. By using the rclcpp library provided by ROS2, the priority of threads can be set.
Based on this, a first aspect of the present invention provides a distributed system task allocation method 100. Fig. 2 shows a schematic block diagram of an embodiment of a distributed system task allocation method according to the present invention. In the embodiment shown in fig. 1, the method comprises:
step S110: evaluating the resource use condition of each service according to the test condition of each service of the equipment and the distributed system task so as to quantify the calculation requirement of each service;
step S120: clustering each ROS2 node in the service, dividing the ROS2 nodes into a plurality of different categories according to the communication relation, determining the node correlation based on the resource use condition of each service and the quantized calculation demand, and judging the resource density degree of the different categories;
step S130: placing related nodes in different categories in a ROS2 container and/or in the same device, thereby reducing communication delay between nodes;
step S140: performing CPU core binding for resource intensive categories in different categories and dividing the resources of the equipment so that each part of the resources are only used for planning tasks or control tasks;
step S150: thread priorities are allocated to the execution threads of different categories according to the importance and real-time requirements of the distributed system tasks.
In general, in response to the above-described problems in the prior art, the present invention is optimized at the node level, the process level, and the thread level, respectively. For this purpose, a corresponding preparation for the optimization of the three layers is first required. Therefore, in step S110, for the device layer as the computing carrier, for the purpose of collaborative optimization of softness and hardness, unit tests of different tasks are performed on the computing device layer, that is, resource usage conditions of each service are evaluated according to test conditions of each service of the device and distributed system tasks, so as to quantify computing requirements of each service.
Subsequently, considering the balance of the computing power of the device and the computing requirements of the task for the task layer, each ROS2 node in the traffic is clustered in step S120, and the ROS2 nodes are divided into a plurality of different categories according to the communication relationship. The goal of step S120 is to first coarsely allocate tasks on the different devices in a distributed manner. And determining node correlation and judging resource density degree of different categories based on resource use conditions and quantized calculation requirements of each service while finishing first coarse-grained allocation, and providing corresponding basis for further optimization at a node layer, a process layer and a thread layer.
At the node level, each node is a basic execution unit of a task. On the basis of realizing the first coarse-grained task allocation, the communication time and the communication data volume of the nodes are strongly related in consideration of the communication between the nodes, so that the communication time delay between the nodes becomes an important factor affecting the task allocation. Thus, for node layers, particularly inter-node communications of different processes of the same device and inter-node communications of different processes of different devices, relevant nodes in different categories are placed in the ROS2 container and/or in the same device in step S130, thereby reducing inter-node communication latency.
While at the process level, a process is the smallest unit of execution task units. On the basis of realizing first coarse-grained task allocation, the conditions of scheduling and switching among cores of processes in the equipment and the process in the running process are considered, and the reliability and real-time running of the processes are further optimized through the resource isolation of the processes and the CPU core binding. For this, in step S140, CPU core binding is performed for resource intensive categories among different categories and the resources of the device are divided such that each part of the resources is used only for planning tasks or control tasks.
Finally, at a thread layer, a plurality of threads in a single process can realize thread scheduling sequence and priority by distributing priority, and the priority configuration is carried out on the threads on the premise of ensuring the thread safety and stability according to the importance and real-time requirements of the threads. Thus, in step S150, thread priorities are assigned to each of the different classes of execution threads based on the importance and real-time requirements of the distributed system tasks.
On the basis of distributing thread priority to different types of execution threads, the execution duration can be preset for the execution threads; then monitoring the execution condition of the execution thread and acquiring the preset execution time length corresponding to the execution thread; if the returned confirmation information of the execution completion of the execution thread is not received when the execution duration corresponding to the preset execution duration is over, sending overtime alarm information; the overtime alarm information is used for prompting a user whether to terminate the target working thread or not; obtaining an abnormal alarm log sent by the distributed equipment and counting the alarm probability of each counting period according to the abnormal alarm log; and generating an alarm probability curve according to the alarm probability of each statistical period, wherein the alarm probability curve is the ratio between the total number of abnormal alarms in the statistical period and the duration corresponding to the statistical period.
The server can count the total number of the abnormal alarm logs in each counting period according to the time of counting the abnormal alarm logs, and the counting period can be performed in hours, and can be specifically set according to actual conditions. In one embodiment, 1 hour may be taken as 1 statistical period, the server may determine a statistical period of each abnormal alarm according to the creation time of the abnormal alarm log, and calculate the alarm probability of each statistical period according to the total number of abnormal alarms in each statistical period and the duration corresponding to the statistical period. The alarm probability of a statistical period is the ratio between the total number of abnormal alarms in the statistical period and the duration corresponding to the statistical period. The alarm probability curve can record the alarm probability of each statistical period and intuitively display the peak time of the certificate making abnormality. The contemporaneous alarm probability can also be predicted according to the alarm probability curve. The contemporaneous alarm probability refers to the alarm probability of the same time period.
In some embodiments of the distributed system task allocation method 100 according to the present invention, step S110 evaluates resource usage of each service according to test conditions of each service of the device and the distributed system task, so as to quantify computing requirements of each service further includes: unit testing of different tasks is carried out on the computing equipment layer of the distributed system, and the CPU utilization rate, the memory usage amount and the bandwidth use requirement of each task are defined according to the testing result. Specifically, as a calculation carrier for task allocation, the CPU, memory, storage, and network bandwidth of the device are important measurement units for task calculation and allocation. And for the purpose of collaborative optimization of the hardness and the softness, unit tests of different tasks are carried out on a computing device layer, and the CPU utilization rate, the memory usage amount and the bandwidth use requirement of each task are defined according to test results so as to quantify the computing requirements of each service, thereby carrying out optimization on a node layer, a process layer and a thread layer in a targeted manner. Preferably, benchmark test is performed according to the device and each service, and the conditions of CPU utilization rate, memory usage amount, bandwidth usage requirement and the like of each service are evaluated.
In some embodiments of the distributed system task allocation method 100 according to the present invention, step S120 clusters the ROS2 nodes in the traffic, classifies the ROS2 nodes into a plurality of different categories according to the communication relationship, and determines the node correlation and determines the resource density of the different categories based on the resource usage of each traffic and the quantified computation demand further comprises: based on different application tasks of the ROS2, a computational graph is formed, the computational graph describes communication and data flow among ROS2 nodes, the nodes are clustered aiming at the computational graph, the clustered types are the same as the number of the computational nodes of the distributed system, and the ROS2 nodes in different types after the clustering are distributed on different computing devices. Specifically, the task layer mainly considers the balance of the computing power of the device and the computing requirement of the task, and aims to perform first coarse-grained allocation on the distributed different devices. The different application tasks of ROS2 constitute a computational graph that describes the communication and data flow between ROS2 nodes. Here, the nodes are clustered with respect to the computation graph, and the classification of the clusters is the same as the number of computation nodes of the distributed system. Different kinds of ROS2 nodes after clustering are distributed on different computing devices, so that first coarse-grained task distribution is realized.
In some embodiments of the distributed system tasking method 100 according to the present invention, step S130 places related nodes in different categories in the ROS2 container and/or in the same device, thereby reducing inter-node communication latency further comprises: nodes with different node communication delays above the node age threshold are combined using the ROS2 container such that inter-node communication is configured as intra-process communication. In addition, step S130 places related nodes in different categories in the ROS2 container and/or in the same device, thereby reducing inter-node communication latency further comprises: for nodes with communication delays of different processes of different devices higher than the device aging threshold, different nodes are distributed to the same device, so that the communication among the nodes is configured as the communication among the nodes in the same device.
In particular, the ROS2 nodes are basic execution units of tasks, and the inter-ROS 2 node communication time is strongly related to the node communication data volume, so that the inter-node communication delay is an important factor to be considered. The communication delay between nodes is divided into three types, namely, the communication between nodes in the same process, the communication between nodes in different processes of the same equipment and the communication between nodes in different processes of different equipment. Here, for the latter two kinds of inter-node communication, that is, inter-node communication of different processes of the same device and inter-node communication of different processes of different devices, based on the allocation result of the task layer, mainly consider the following optimization method: 1) For nodes with higher communication delay of nodes in different processes, a ROS2 container mode is used to combine the nodes into a container, so that the communication mode is changed into intra-process communication; 2) For nodes with higher communication delay of different processes of different devices, different nodes are distributed to the same device, so that the communication mode of the nodes is changed into node communication in the same device. Through the two modes, the communication time delay of the nodes can be optimized, so that the real-time performance of the system is improved.
Further, in some embodiments of the distributed system task allocation method 100 according to the present invention, step S140 of CPU core binding for resource intensive categories of different categories and partitioning the resources of the device such that each portion of the resources is used only for planning tasks or control tasks further comprises: step S141: and binding the first resource-intensive category with a first part of cores of the CPU and binding the second resource-intensive category with a second part of cores of the CPU according to the resource-intensive degree. Furthermore, in some embodiments, step S140 of CPU core binding and partitioning the resources of the device for resource intensive categories of the different categories such that each portion of the resources is used only for planning tasks or control tasks further comprises: step S142: the CPU and the memory of the device are partitioned by the control group to be divided into two resource parts, wherein the first resource part is configured to be used for planning class tasks, and the second resource part is configured to be used for controlling the class tasks.
Specifically, the optimization for the process layer mainly consists in the resource isolation and the CPU core binding of the process so as to realize the reliable and real-time running of the process. Thus, on the one hand, CPU core binding is used to bind processes to specific CPU cores to reduce scheduling and switching of processes between CPU cores. Thus, in step S141, a first resource-intensive category is bound to a first portion of cores of the CPU and a second resource-intensive category is bound to a second portion of cores of the CPU, depending on the resource-intensive level. For example, the first resource intensive type is bound to 0-n cores of the CPU, and all or part of the second resource intensive type is bound to n+1-m cores of the CPU (n and m are positive integers and are smaller than or equal to the total number of cores of the CPU), so that the resource preemption between processes can be avoided, and the stability and instantaneity of the system can be ensured. On the other hand, the resource isolation of the process can be used for allocating resources for important nodes, so that the resource use of the related process is controlled and guaranteed. Therefore, in step S142, the CPU and the memory of the device are partitioned by the control group to be divided into two resource portions, the first resource portion being configured for planning the class task and the second resource portion being configured for controlling the class task. For example, on the resource isolation level, the CPU and the memory of the equipment are partitioned into two parts, one part is used for planning the class task, the other part is used for controlling the class task, and the step can ensure the stable real-time operation of the control class.
According to the distributed system task allocation calculation method based on the ROS2, the task allocation function of a plurality of application tasks in the distributed system based on the ROS2 is achieved, the load balancing target is achieved, and the reliability and the instantaneity of the system tasks are guaranteed. Compared with a method for manually distributing tasks in a distributed system, the method provided by the invention is more efficient and easy to use, and has higher task distribution granularity, and meanwhile, the method provided by the invention can be used for distributing and optimizing the tasks in finer granularity including a process level and a thread level, so that the resource isolation and the load balancing are realized, and the real-time operation of key tasks is ensured.
For a better description of the method and principles of the present invention, the method of the present invention will be described in further detail below in connection with an embodiment of autopilot framework.
The autopilot framework Autoware. University is an autopilot-oriented software open source framework, comprises autopilot business modules such as control, perception, planning, positioning and the like, has a simulation test platform, and can be used for development and research personnel to develop and test autopilot software. Autoware. University operates on heterogeneous distributed computing platforms, and multiple services and tasks of the framework need to be distributed on each computing device, so that an efficient and stable distributed autopilot computing system is realized. In this embodiment, the heterogeneous distributed computing platform used is mainly composed of two parts: x86 architecture server 1, ARM architecture edge computing device 4, the specific information of the device is shown in the following table 1:
Table 1:
the implementation process of the embodiment is as follows:
the communication relationship between the services of the Autoware. University is shown in fig. 3, and is mainly divided into a sensor, a sensing, a control, a positioning, a planning and the like. First, benchmark test is carried out according to equipment and each service, the CPU utilization rate, memory, network bandwidth use and other conditions of each service are evaluated, and the calculation requirement of each service is evaluated quantitatively.
Secondly, according to the Autoware. Universe service, each ROS2 node in the service is clustered, and the ROS2 service nodes are divided into 5 categories according to the communication relation: the system, the sensing, the sensor, the positioning and the regulation realize the coarse-granularity task division.
Coarse-grained partitioning enables efficient classification of traffic, but there is still more space for optimization. For the sensing and sensor nodes, the communication data volume between the nodes is larger, if the ROS2 nodes calculate the data volume respectively, the communication delay is greatly increased, and by putting the relevant nodes into the ROS2 container, the communication mode between the nodes is optimized to be in-process communication, and the communication delay is greatly reduced. Meanwhile, if the sensor node and the sensing node are respectively distributed to two orins (chips), communication delay is increased due to large data transmission quantity and cross-device communication, and delay results can be effectively reduced by distributing the sensing part nodes to the devices where the sensor node is located.
At the process layer, the method mainly comprises two methods of resource isolation and CPU core binding. Because the sensor module and the perception module need to process data of the laser radar, a large amount of CPU resources are occupied, and the module is time-consuming, in order to ensure the real-time performance of the system, the module is simultaneously provided with two methods of resource isolation and CPU core binding. The CPU core binding layer binds the sensor class to 0-4 cores of the CPU and part of the perception class to 5-11 cores of the CPU, so that the resource preemption between processes can be avoided, and the stability and instantaneity of the system are ensured. The CPU and the memory of the orin are partitioned into two parts through the cgroup, one part of the CPU and the memory is used for planning class tasks, and the other part of the CPU and the memory is used for controlling class tasks.
After the task distribution with the coarse granularity is completed, the priority of the threads is required to be set according to the importance of the tasks in the system, so that the priority of task execution in the system is ensured, and the stability of the tasks in the system is ensured. For an automatic, universal automatic driving framework, normal execution of the control module is a baseline guarantee of the whole system, so that higher thread priority is set for execution threads of control types, and the stability of the system is guaranteed.
According to the steps, in the distributed computing system, the task allocation results of each service of Autoware. Universe are shown in the following table 2, and by using the allocation scheme, on one hand, the computing delay of the system can be greatly reduced, and meanwhile, the reliability and the priority operation of key tasks can be ensured, and the real-time performance and the stability of the system are ensured.
Table 2:
/>
it should be noted that the above examples are intended to illustrate a specific implementation of the method of the invention and should not be understood as being used only for autopilot according to the method of the invention. Conversely, the method according to the invention may find application in any other suitable scenario, for example not only in the field of conventional ROS 2-based distributed computing systems, but also in autopilot, robotic, etc. edge applications.
In a second aspect of the present invention, there is also provided a distributed system task allocation system 200. Fig. 4 shows a schematic block diagram of an embodiment of a distributed system task allocation system 200 according to the present invention. As shown in fig. 4, the system includes:
the device layer module 210, the device layer module 210 is configured to evaluate the resource usage of each service according to the test condition of each service of the device and the distributed system task, so as to quantify the calculation requirement of each service;
The task layer module 220 is configured to cluster each ROS2 node in the service, divide the ROS2 nodes into a plurality of different categories according to the communication relationship, determine node correlation based on the resource use condition of each service and the quantized calculation demand, and judge the resource density degree of the different categories;
a node layer module 230, the node layer module 230 configured to place related nodes in different categories in a ROS2 container and/or in the same device, thereby reducing inter-node communication latency;
a process layer module 240, the process layer module 240 configured to perform CPU core binding for resource intensive categories among the different categories and divide the resources of the device such that each portion of the resources is used only for planning tasks or control tasks;
a thread layer module 250, the thread layer module 250 being configured to assign thread priorities to the various classes of execution threads according to importance and real-time requirements of the distributed system tasks.
In a third aspect of the embodiment of the present invention, a computer readable storage medium is provided, and fig. 5 is a schematic diagram of a computer readable storage medium of a distributed system task allocation method according to an embodiment of the present invention. As shown in fig. 5, the computer-readable storage medium 300 stores computer program instructions 310, the computer program instructions 310 being executable by a processor. The computer program instructions 310, when executed, implement the method of any of the embodiments described above.
It should be appreciated that all of the embodiments, features and advantages set forth above for the distributed system task allocation method according to the present invention equally apply to the distributed system task allocation system and storage medium according to the present invention without conflict.
In a fourth aspect of the embodiments of the present invention, there is also provided a computer device 400 comprising a memory 420 and a processor 410, the memory having stored therein a computer program which, when executed by the processor, implements the method of any of the embodiments described above.
As shown in fig. 6, a schematic hardware structure of an embodiment of a computer device for performing a distributed system task allocation method according to the present invention is shown. Taking the example of a computer device 400 as shown in fig. 6, a processor 410 and a memory 420 are included in the computer device, and may further include: an input device 430 and an output device 440. The processor 410, memory 420, input device 430, and output device 440 may be connected by a bus or other means, for example in fig. 6. Input device 430 may receive input numeric or character information and generate signal inputs related to distributed system task assignments. The output 440 may include a display device such as a display screen.
The memory 420 is used as a non-volatile computer readable storage medium for storing non-volatile software programs, non-volatile computer executable programs, and modules, such as program instructions/modules corresponding to the resource monitoring method in the embodiment of the present application. Memory 420 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created by use of the resource monitoring method, and the like. In addition, memory 420 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 420 may optionally include memory located remotely from processor 410, which may be connected to the local module via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The processor 410 executes various functional applications of the server and data processing, i.e., implements the methods of the method embodiments described above, by running non-volatile software programs, instructions, and modules stored in the memory 420.
Finally, it should be noted that the computer-readable storage media (e.g., memory) herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. By way of example, and not limitation, nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, RAM may be available in a variety of forms such as synchronous RAM (DRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The storage devices of the disclosed aspects are intended to comprise, without being limited to, these and other suitable types of memory.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The various illustrative logical blocks, modules, and circuits described in connection with the disclosure herein may be implemented or performed with the following components designed to perform the functions herein: a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP and/or any other such configuration.
The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
It should be understood that as used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items. The foregoing embodiment of the present invention has been disclosed with reference to the number of embodiments for the purpose of description only, and does not represent the advantages or disadvantages of the embodiments.
Those of ordinary skill in the art will appreciate that: the above discussion of any embodiment is merely exemplary and is not intended to imply that the scope of the disclosure of embodiments of the invention, including the claims, is limited to such examples; combinations of features of the above embodiments or in different embodiments are also possible within the idea of an embodiment of the invention, and many other variations of the different aspects of the embodiments of the invention as described above exist, which are not provided in detail for the sake of brevity. Therefore, any omission, modification, equivalent replacement, improvement, etc. of the embodiments should be included in the protection scope of the embodiments of the present invention.

Claims (10)

1. A distributed system task allocation method, comprising the steps of:
Evaluating the resource use condition of each service according to the test condition of each service of the equipment and the distributed system task so as to quantify the calculation requirement of each service;
clustering each ROS2 node in the service, dividing the ROS2 nodes into a plurality of different categories according to the communication relation, determining the node correlation based on the resource use condition of each service and the quantized calculation demand, and judging the resource density degree of the different categories;
placing related nodes in different categories in a ROS2 container and/or in the same device, thereby reducing communication delay between nodes;
performing CPU core binding for resource intensive categories in different categories and dividing the resources of the equipment so that each part of the resources are only used for planning tasks or control tasks;
thread priorities are allocated to the execution threads of different categories according to the importance and real-time requirements of the distributed system tasks.
2. The method of claim 1, wherein evaluating the resource usage of each service based on the testing of each service for the device and distributed system tasks to quantify the computing requirements of each service further comprises:
unit testing of different tasks is carried out on the computing equipment layer of the distributed system, and the CPU utilization rate, the memory usage amount and the bandwidth use requirement of each task are defined according to the testing result.
3. The method of claim 1 or 2, wherein clustering the ROS2 nodes in the traffic, classifying the ROS2 nodes into a plurality of different classes according to the communication relationship, and determining the node relevance and determining the resource density of the different classes based on the resource usage and the quantified computational demand of the respective traffic further comprises:
based on different application tasks of the ROS2, a computational graph is formed, the computational graph describes communication and data flow among ROS2 nodes, the nodes are clustered aiming at the computational graph, the clustered types are the same as the number of the computational nodes of the distributed system, and the ROS2 nodes in different types after the clustering are distributed on different computing devices.
4. The method of claim 1 or 2, wherein said placing related nodes in different categories in a ROS2 container and/or in the same device, thereby reducing inter-node communication latency further comprises:
combining nodes with different node communication delays higher than a node aging threshold by using an ROS2 container, so that the inter-node communication is configured as in-process communication; and
for nodes with communication delays of different processes of different devices higher than the device aging threshold, different nodes are distributed to the same device, so that the communication among the nodes is configured as the communication among the nodes in the same device.
5. The method of claim 1 or 2, wherein the CPU core binding and partitioning the resources of the device for resource intensive categories of the different categories such that each portion of the resources is used only for planning tasks or control tasks further comprises:
and binding the first resource-intensive category with a first part of cores of the CPU and binding the second resource-intensive category with a second part of cores of the CPU according to the resource-intensive degree.
6. The method of claim 1 or 2, wherein the CPU core binding and partitioning the resources of the device for resource intensive categories of the different categories such that each portion of the resources is used only for planning tasks or control tasks further comprises:
the CPU and the memory of the device are partitioned by the control group to be divided into two resource parts, wherein the first resource part is configured to be used for planning class tasks, and the second resource part is configured to be used for controlling the class tasks.
7. The method according to claim 1, wherein the method further comprises:
presetting execution time length for the execution thread;
monitoring the execution condition of the execution thread and acquiring a preset execution duration corresponding to the execution thread;
If the returned confirmation information of the execution completion of the execution thread is not received at the end of the execution duration corresponding to the preset execution duration, sending overtime alarm information;
the overtime alarm information is used for prompting a user whether to terminate the target working thread or not; and
acquiring an abnormal alarm log sent by the distributed equipment and counting the alarm probability of each counting period according to the abnormal alarm log;
and generating an alarm probability curve according to the alarm probability of each statistical period, wherein the alarm probability curve is the ratio between the total number of abnormal alarms in the statistical period and the duration corresponding to the statistical period.
8. A distributed system task allocation system, comprising:
the equipment layer module is configured to evaluate the resource use condition of each service according to the test condition of each service of equipment and distributed system tasks so as to quantify the calculation requirements of each service;
the task layer module is configured to cluster each ROS2 node in the service, divide the ROS2 nodes into a plurality of different categories according to the communication relation, determine node correlation based on the resource use condition of each service and the quantized calculation demand, and judge the resource density degree of the different categories;
A node layer module configured to place related nodes in different categories in a ROS2 container and/or in the same device, thereby reducing inter-node communication latency;
the process layer module is configured to bind CPU cores aiming at resource-intensive categories in different categories and divide the resources of the equipment so that each part of the resources are only used for planning tasks or controlling tasks;
and the thread layer module is configured to allocate thread priorities for execution threads of different categories according to the importance and real-time requirements of the distributed system tasks.
9. A computer readable storage medium, storing computer program instructions which, when executed, implement a distributed system task allocation method according to any one of claims 1-7.
10. A computer device comprising a memory and a processor, wherein the memory has stored therein a computer program which, when executed by the processor, performs the distributed system task allocation method of any of claims 1-7.
CN202311270711.6A 2023-09-28 2023-09-28 Distributed system task allocation method, system, storage medium and equipment Pending CN117234733A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311270711.6A CN117234733A (en) 2023-09-28 2023-09-28 Distributed system task allocation method, system, storage medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311270711.6A CN117234733A (en) 2023-09-28 2023-09-28 Distributed system task allocation method, system, storage medium and equipment

Publications (1)

Publication Number Publication Date
CN117234733A true CN117234733A (en) 2023-12-15

Family

ID=89087728

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311270711.6A Pending CN117234733A (en) 2023-09-28 2023-09-28 Distributed system task allocation method, system, storage medium and equipment

Country Status (1)

Country Link
CN (1) CN117234733A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117472595A (en) * 2023-12-27 2024-01-30 苏州元脑智能科技有限公司 Resource allocation method, device, vehicle, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117472595A (en) * 2023-12-27 2024-01-30 苏州元脑智能科技有限公司 Resource allocation method, device, vehicle, electronic equipment and storage medium
CN117472595B (en) * 2023-12-27 2024-03-22 苏州元脑智能科技有限公司 Resource allocation method, device, vehicle, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107291545B (en) Task scheduling method and device for multiple users in computing cluster
CN108632365B (en) Service resource adjusting method, related device and equipment
CN111427681A (en) Real-time task matching scheduling system and method based on resource monitoring in edge computing
CN111625331B (en) Task scheduling method, device, platform, server and storage medium
CN111464659A (en) Node scheduling method, node pre-selection processing method, device, equipment and medium
US20160203026A1 (en) Processing a hybrid flow associated with a service class
CN117234733A (en) Distributed system task allocation method, system, storage medium and equipment
WO2024021489A1 (en) Task scheduling method and apparatus, and kubernetes scheduler
US20190280945A1 (en) Method and apparatus for determining primary scheduler from cloud computing system
US20220138012A1 (en) Computing Resource Scheduling Method, Scheduler, Internet of Things System, and Computer Readable Medium
CN115543577B (en) Covariate-based Kubernetes resource scheduling optimization method, storage medium and device
CN112015549B (en) Method and system for selectively preempting scheduling nodes based on server cluster
CN113168344A (en) Distributed resource management by improving cluster diversity
CN114625500A (en) Method and application for scheduling micro-service application based on topology perception in cloud environment
CN110196773B (en) Multi-time-scale security check system and method for unified scheduling computing resources
CN109614210B (en) Storm big data energy-saving scheduling method based on energy consumption perception
Shukla et al. Fault tolerance based load balancing approach for web resources in cloud environment.
Beltrán et al. How to balance the load on heterogeneous clusters
CN114844791B (en) Cloud service automatic management and distribution method and system based on big data and storage medium
CN114443293A (en) Deployment system and method for big data platform
CN114090201A (en) Resource scheduling method, device, equipment and storage medium
Liu et al. Scheduling tasks with Markov-chain based constraints
CN113391928B (en) Hardware resource allocation method and device, electronic equipment and storage medium
Guo et al. Reliable Scheduling Method for Sensitive Power Business Based on Deep Reinforcement Learning.
US11789773B2 (en) Computing device for handling tasks in a multi-core processor, and method for operating computing device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination