CN112711448A

CN112711448A - Agent technology-based parallel component assembling and performance optimizing method

Info

Publication number: CN112711448A
Application number: CN202011608335.3A
Authority: CN
Inventors: 彭云峰; 石聪明; 刘家磊; 高国伟; 刘海; 汪加楠
Original assignee: Anyang Normal University
Current assignee: Anyang Normal University
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2021-04-27

Abstract

The invention provides a parallel component assembling and performance optimizing method based on an Agent technology, which comprises the following steps: agents with different functions are produced through an Agent management system; the component connecting Agent is responsible for the adhesion and data redistribution of component interfaces, and the component executing Agent and the resource management Agent cooperate with each other to deploy the component on a computing node meeting the requirements; 4 different component self-adaptive strategies are defined, different component self-adaptive agents, component execution agents and resource management agents cooperate with one another, the self-adaptive process of the components is completed according to different conditions of platform computing resources, and the performance of the components is improved. When load balancing is carried out, the resource management Agent, the load detection Agent and the component execution Agent cooperate with each other to complete load balancing work, and the performance and the throughput of the whole computing platform are improved. The method based on the Agent technology is flexible to use and has performance advantages.

Description

Agent technology-based parallel component assembling and performance optimizing method

Technical Field

The invention relates to the technical field of computers, in particular to a parallel component assembling and performance optimizing method based on an Agent technology.

Background

The basic building blocks of a parallel building block program are parallel building blocks. The parallel component internally encapsulates program code for parallel computing, and is generally deployed on heterogeneous computing platforms for execution. These components interact with each other through shared memory, interprocess communication or network communication under the support of an operating framework. The members themselves may have different degrees of parallelism. They can be executed in parallel in a multiprocess manner, in a multithread manner, or even in a mixed manner. The difference in component execution modes results in two parallel component programming models, SCMD and MCMD. Parallel component programs tend to use a mix of these two programming models. The implementation code of the component is generally C, C + +, Java, Fortran and other programming languages, and MPI, OpenMP and other parallelization instruction statements. When building a component program, the situation that components written in different languages need to be called and interacted with each other is often encountered. Babel provides interoperability functions of programming languages commonly used for high performance computing. By taking Babel as an intermediary, the interoperation of parallel components written in different languages can be realized. However, if two component interfaces called each other have different definitions and have different interface names, input parameters or return values, connection through Babel cannot be performed. The traditional parallel software performance optimization usually adopts methods of performance prediction, self-adaption and load balancing. Most often, the parallelism of a component is changed or components on a high-load computing node are migrated to a low-load computing node according to the change of running resources. The existing parallel component program performance optimization method is only to solidify the traditional parallel software performance optimization method in a parallel component operation framework, the parallel component operation framework is responsible for component connection and data interaction, and each parallel component framework defines a component interaction mode which can be provided by the parallel component framework. The framework-defined component interaction and performance optimization mechanism based on the framework often has certain limitations, cannot be changed according to the characteristics and running conditions of a component program, and cannot flexibly optimize the performance of the parallel component program from the perspective of the component.

Agent technology is widely applied in distributed systems, and the Agent can be generally regarded as a computing entity and can be composed of software and hardware. In a distributed system, agents are generally autonomous, interactive, reactive, and proactive. The Agent can sense the change of the external environment and the internal state of the Agent and then make corresponding reaction. Agents can interact and work cooperatively, and can actively submit information to a specific target, and most of parallel component programs have the characteristic of distribution. The use of Agent technology in parallel component program connection and operation has natural advantages.

Disclosure of Invention

Aiming at the defects in the background technology, the invention provides a parallel component assembling and performance optimizing method based on the Agent technology, and solves the technical problems that the existing performance optimizing mechanism has certain limitation and the parallel component program performance optimizing flexibility is poor.

The technical scheme of the invention is realized as follows:

a parallel component assembling and performance optimizing method based on Agent technology comprises the following steps:

the method comprises the following steps: respectively defining the agents with different functions as different C + + classes, and respectively generating the agents with different functions through an Agent management system by the different C + + classes, wherein the agents with different functions comprise a member connection Agent, a member execution Agent, a resource management Agent and a self-adaptive Agent;

step two: the component connection Agent completes the construction of a component program by calling a Babel tool to carry out adhesion and data redistribution on different component interfaces, destroys the component connection Agent by an Agent management system and releases resources occupied by the component connection Agent;

step three: the component execution Agent and the resource management Agent cooperate with each other to deploy a component program on a computing node meeting the requirement;

step four: self-adaptively deploying a parallel component program on a computing node meeting the requirement through mutual cooperation of a self-adaptive Agent, a component execution Agent and a resource management Agent;

step five: the resource management Agent requests the Agent management system to generate a load detection Agent for the resource management Agent, and the component tasks on the high-load computing nodes are transferred to the low-load computing nodes to be executed through the cooperative operation of the resource management Agent, the load detection Agent and the component execution Agent, so that the throughput and the performance of the component programs are improved.

In the second step, the specific operation method is as follows:

in the interface bonding process, a component connecting Agent of a calling component calls Babel to convert an interface of the calling component into an SIDL form, and a component connecting Agent of a regulated component calls Babel to convert the interface of the regulated component into the SIDL form; when the interface of the calling component is not matched with the interface of the called component, the component connecting Agent of the called component changes the interface name of the SIDL form of the called component into the SIDL interface name form of the calling component, so that the calling component identifies the called component matched with the calling component;

for the interface parameters of the calling component and the interface parameters of the called component, when the interface parameters of the calling component are transmitted to the called component, the component connecting Agent of the calling component uses an adhesive code to convert the interface parameters of the calling component into the number and the type of the interface parameters of the called component; when the number of the interface parameters of the calling component and the number of the interface parameters of the called component are not equal, assigning the excessive interface parameters to be NULL; at the end of the operation of the bonding code of the member to be adjusted, the member connecting Agent of the member to be adjusted changes the return value type of the member to be adjusted into the return value type of the interface of the calling member through the bonding code;

when a component program is established, a calling component runs on M processes, a called component runs on N processes, M is not equal to N, the component connecting Agent of the calling component is used for collecting the running results of the M processes of the calling component, and the running results are distributed to the N processes of the called component through the component connecting Agent of the called component, so that the M multiplied by N data redistribution is realized; after the adhesive code and the data redistribution code are generated, the Agent management system destroys the component connecting Agent of the calling component and the component connecting Agent of the called component, and releases the resources occupied by the two component connecting agents.

In the third step, the specific operation method is as follows:

generating a resource management Agent for each computing node by using an Agent management system;

when the component is allocated with resources, the component execution Agent sends the resource requirements of the component program to be deployed to the resource management Agent corresponding to the nearest computing node;

when the computing node corresponding to the nearest resource management Agent can meet the resource requirement of the component program to be deployed, writing the information of the nearest computing node into the message and transmitting the information back to the component execution Agent; when the resource of the nearest computing node can not meet the resource requirement of the component program to be deployed, writing the information of the computing node into a message, and transmitting the message to the next resource management Agent; when the computing node managed by the next resource management Agent cannot meet the resource requirement of the component program to be deployed, writing the information of the computing node managed by the next resource management Agent into a message, and transmitting the message to the next resource management Agent until the resource sum of the computing nodes with registered information meets the resource requirement of the component program to be deployed;

and the last computing node registering the information transmits a message back to the component execution Agent of the component program to be deployed, and the component execution Agent deploys the component program to the registered computing nodes meeting the requirements and starts to run.

The self-adaptive agents comprise parallelism self-adaptive agents, data division self-adaptive agents, component migration self-adaptive agents and change realization self-adaptive agents.

The method for deploying the parallel components on the computing nodes meeting the requirements in a self-adaptive mode through mutual cooperation of the parallelism self-adaptive Agent, the component execution Agent and the resource management Agent comprises the following steps:

a component program builder sets a load threshold, a time threshold and the number n of computing nodes before running a component, and writes the load threshold, the time threshold and the number n into a parallelism self-adaptive Agent;

when the member is initially distributed to m CPU processor cores to run, the parallelism adaptive Agent times the running time of the member, and simultaneously sends load query information and a member program to the member execution agents of the adjacent computing nodes;

checking the load condition of the adjacent computing nodes, if the load condition is higher than a load threshold value, sending the load query information to the next computing node, otherwise, writing the resource information of the adjacent computing nodes into the load query information, and continuously transmitting the load query information to the next computing node until the sum of the number of CPU processor cores of the computing nodes which are contained in the load query information and are lower than the load threshold value meets the requirement of a parallelism self-adaptive Agent; once the running time of the component exceeds a time threshold value, the parallelism adaptive Agent simultaneously meets the requirement of the parallelism adaptive Agent by the number of available CPU processor cores, makes an adaptive decision, suspends the execution of the component, redistributes the unexecuted data, and distributes the component to m + n CPU processor cores for parallel execution.

The method for adaptively deploying the parallel component programs on the computing nodes meeting the requirements by the mutual cooperation of the data division adaptive Agent, the component execution Agent and the resource management Agent comprises the following steps:

after each computing node finishes the distributed tasks, the data division self-adaptive Agent detects the ratio of the finished data volume to the total data volume;

when the ratio is smaller than 1/10, triggering and changing a self-adaptive strategy of data division, suspending parallel execution of component programs by the self-adaptive Agent of data division, collecting unfinished data tasks, and performing division and execution again;

when the ratio is greater than 1/10, the data partitioning adaptive function of the component program operation is turned off, and the remaining part of the component program operation does not perform the detection of the data amount any more.

The method for self-adaptively deploying the parallel component program on the computing node meeting the requirements by the mutual cooperation of the component migration self-adaptive Agent, the component execution Agent and the resource management Agent comprises the following steps:

before component scheduling, allocating a component migration self-adaptive Agent for the component by an Agent management system; selecting a stable computing node as a backup computing node, and calling a member migration adaptive Agent to send a copy of a member to the backup computing node, wherein the copy comprises the member execution Agent and the copy of the member migration adaptive Agent; in the operation process, if the computing node fails due to failure, the intermediate result is taken out, the component instance on the backup computing node is started, and the component task is continuously completed.

The method for adaptively deploying the parallel component programs on the computing nodes meeting the requirements by changing the mutual cooperation of the self-adaptive Agent, the component execution Agent and the resource management Agent comprises the following steps:

in the running process of the component, the hardware condition of actively detecting each computing node by the self-adaptive Agent is changed; if the computing node which contains the GPU and is light in load is found, the self-adaptive Agent is changed to suspend the execution of the component, the intermediate result is saved, then the component execution Agent is called, and the component of the special version is deployed to the computing node which is light in load to continue to run.

In the fifth step, the specific operation method is as follows:

starting a load balancing mechanism of the Agent management system, and defining a load detection period and an upper limit of a load by a constructor of a component program;

taking the ratio of load average returned by the examination top command to the number of CPU cores as the load condition of a computing node where a component program is located;

the resource management Agent on each computing node periodically detects the load condition of the computing node, and if the load of the computing node is found to be larger than the upper limit of the load during detection, the resource management Agent informs component execution agents of all component programs on the computing node, suspends the execution of the component programs, and stores intermediate results in the component execution agents;

the resource management Agent requests the Agent management system to generate a load detection Agent for the resource management Agent, the load detection Agent moves autonomously in the computing platform, load information of each computing node is collected, then the load detection Agent returns to a source computing node initiating load detection, and the load information stored on the source computing node is updated; and after the load detection Agent returns load information to the source computing node, the load detection Agent is deleted by the Agent management system, and resources occupied by the load detection Agent are released.

Compared with the prior art, the invention has the following beneficial effects:

1) a plurality of different agents are used in the building and running stages of the component program to assist the connection and running of the component, and the running performance of the component is improved.

2) In the component program building stage, the component connecting Agent is used for bonding the parallel components with unmatched interfaces and simultaneously supporting the M multiplied by N data redistribution.

3) When the component is subjected to initial scheduling operation, the computing nodes meeting the operation requirements of the component are searched through the resource management Agent, and the component execution Agent is responsible for deploying the component to the computing resources.

4) In the operation stage of the component, 4 different component self-adaptive strategies are defined; the self-adaptive Agent of different components, the component execution Agent and the resource management Agent cooperate with each other, the self-adaptive process of the components is completed according to different conditions of platform computing resources, and the performance of the components is improved.

5) When load balancing is carried out, the resource management agents distributed on each computing node, the component execution agents of the components to be migrated and the load detection agents acting autonomously are matched with each other, so that the whole load balancing work is completed together, and the performance and the throughput of the whole computing platform are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of the interface bonding and data redistribution of the present invention.

FIG. 2 is a schematic illustration of the deployment process of the components of the present invention.

FIG. 3 is a flow chart of the parallelism adaptive Agent of the present invention.

FIG. 4 shows the results of testing the Background component of the present invention.

FIG. 5 shows the results of experiments performed on the VERinter component according to the present invention.

FIG. 6 shows the performance comparison of Topographic components based on ICENI and the method of the present invention for implementing the adaptive mechanism at different input data scales.

Fig. 7 shows the results of a comparative load balancing experiment for different numbers of component programs according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.

The embodiment of the invention provides a parallel component assembling and performance optimizing method based on an Agent technology, which is used for assisting the connection and operation of components by using various different agents in the building and operation stages of component programs and improving the operation performance of the components. In the component program building stage, the component connecting Agent is used for bonding the parallel components with unmatched interfaces and simultaneously supporting the M multiplied by N data redistribution. When the component is subjected to initial scheduling operation, the computing nodes meeting the operation requirements of the component are searched through the resource management Agent. The component execution Agent is responsible for deployment of the component to the computing resource. And in the operation stage of the component, a component self-adaption Agent is adopted to carry out component self-adaption according to the resource condition of the operation platform, so that the performance of the component is improved. When load balancing is carried out, the load detection agents collect the load conditions on each computing node, and the component execution agents complete the load migration process. The above agents will all be generated and distributed by the Agent management system as required. Each type of Agent has been defined by function as a class of C + +. When an Agent needs to be generated, the Agent management system will generate an instance of a particular Agent class. The method comprises the following specific steps:

step two: the component connection Agent completes the construction of a component program by calling a Babel tool to perform adhesion and data redistribution on different component interfaces, destroys the component connection Agent by an Agent management system, and releases resources occupied by the component connection Agent;

when a component program builder connects components, it specifies which components need to be connected two by two through which interfaces and the degree of parallelism with which each component initially runs. For two components which need to be combined and connected, an Agent management system is firstly required to allocate one component connection Agent for each of the two components. After the member connecting Agent is generated, the member connecting Agent can actively detect the interface conditions of two members and judge whether the member connecting Agent can be bonded by using a specific bonding code according to the name and the parameter of the interface. The Babel multi-language interoperation mechanism is used for converting two component interfaces which are connected with each other into an SIDL form, so that the matching connection of the interfaces is realized. However, if the interface names, input parameters, or return result types of the two interfaces are different, they cannot be matched even if they are converted into the SIDL form.

Based on Babel, as shown in fig. 1, the present invention provides an adhesion mechanism for interfaces with different interface names, input parameters, and return value types. The precondition of such a bonding is that when connecting components, the constructor of the component program has already determined that the two components have a functional call relationship with each other, and after the interfaces are matched, they must be able to run in a connected manner. The specific operation method comprises the following steps:

in the interface bonding process, a component connecting Agent of a calling component calls Babel to convert an interface of the calling component into an SIDL form, and a component connecting Agent of a regulated component calls Babel to convert the interface of the regulated component into the SIDL form; when the interface of the calling component is not matched with the interface of the called component, the component connecting Agent of the called component changes the interface name of the SIDL form of the called component into the SIDL interface name form of the calling component, so that the calling component identifies the called component matched with the calling component. For the interface parameters of the calling component and the interface parameters of the called component, when the interface parameters of the calling component are transmitted to the called component, the component connecting Agent of the calling component uses the adhesive code to convert the interface parameters of the calling component into the number and the type of the interface parameters of the called component, so that the called component can use the interface parameters conveniently. When the number of the interface parameters of the calling component and the number of the interface parameters of the called component are not equal, assigning the excessive interface parameters to be NULL; and finally, the component connection Agent of the called component changes the return value type of the called component into the return value type of the interface of the calling piece through the bonding code so as to correctly return the calling component.

Meanwhile, when the component program is built, different component connection agents cooperate with each other, and M multiplied by N data redistribution can be realized. When the building block program is built, the calling building block runs on M processes, the called building block runs on N processes, and M is not equal to N, so that the redistribution operation of data between the two building blocks is required. Collecting the running results of M processes of the calling member by using the member connecting Agent of the calling member, and distributing the running results to N processes of the called member through the member connecting Agent of the called member, thereby realizing M multiplied by N data redistribution; the working process of the component connection Agent is given in fig. 1, and it is assumed that the component a initiating the call is implemented in a C + MPI programming manner. During the code execution process of the component A, a call to the component B is initiated, the interface name of the call is comp, and the input parameters are isready of a Boolean type and optobject of a character type. At this time, the component connection Agent of a initiates an inquiry to the component connection Agent of B to inquire about the interface provided by B. The called component B is implemented using FORTRAN + MPI programming. B provides different interface names, parameters, return value types and a. Therefore, the component connecting Agent of A informs the component connecting Agent of B, starts the bonding process, finally connects the two components in an interface matching manner, and adds the data collection and distribution operation in the connection process, thereby realizing the M × N data redistribution. In the component program building stage, the interface bonding of the two components and the generation of data redistribution codes are completed. These codes are automatically executed when the component phase intermodulation is used in the run phase. After the adhesive code and the data redistribution code are generated, the Agent management system destroys the component connecting Agent of the calling component and the component connecting Agent of the called component, and releases the resources occupied by the two component connecting agents.

once a component program is assembled, each component thereof is assigned a component execution Agent by the Agent management system. The component program will then be deployed to run on the computing platform. When a component program is executed, its constituent units, i.e., the components constituting the program, may have data dependency and other dependencies. Only those dependent components that have completed execution, have the dependency data available, and the components whose dependencies have resolved satisfy the conditions under which they were executed. The constructor of the component program needs to give the dependency relationship among the components, and then writes the information of other components depended by the current component into a dependency relationship table in the component execution Agent in the generation process of the component execution Agent. When one component finishes running, the component execution Agent of the component sends a message to the component execution agents of other components to declare that the component has finished running. This message will be passed between the various component execution agents, each Agent receiving the message checking whether this finished component is relied upon by the component it manages. If a dependency exists, the Agent that receives the message will delete the information of the finished component from its own dependency table. When all dependencies of a component are resolved, its component execution Agent will begin to find appropriate computing resources for it to run for deployment.

The classical CCA parallel component mode of operation is to keep a copy of all components on each compute node. If it is desired to specify deployment component A to run on compute node B, the component builder is required to write this deployment information in the rc startup file. Thus, component A will only launch the instance code on compute node B when it is launched, and the backup of A on other compute nodes is not launched. This method makes the deployment work of the member relatively simple. However, when a large number of component programs need to be deployed and run, reserving a backup for all the component programs on all the computing nodes will consume a large amount of system resources. In order to better deploy the components, a multi-core (16-core, internal memory 32G) computing node with better performance in the computing platform is selected as a root computing node. All the component programs are loaded on the root computing node before the first operation, and the component connection work (including the interface bonding and the generation of the data redistribution code explained above) is completed on the root computing node. After the component connection is completed, each component is assigned with a component execution Agent. After all dependencies of a component are resolved, its component execution Agent will start the deployment run of this component.

When the component execution Agent deploys and runs a component, computing resources are allocated to the component according to the resource requirements predefined by the component builder. The computing resources mainly comprise the number of CPU cores of the computing nodes, the size of a memory and the network bandwidth. In order to better manage the computing resources on the computing nodes, an Agent management system is used for generating a resource management Agent for each computing node; the resource management Agent can autonomously detect various resource conditions of the computing node.

When the component is allocated with resources, the component execution Agent sends the resource requirements of the component program to be deployed to the resource management Agent corresponding to the nearest computing node; when the nearest computing node can meet the resource requirement of the component program to be deployed, writing the information of the nearest computing node into the message and transmitting the information back to the component execution Agent; when the nearest computing node can not meet the resource requirement of the component program to be deployed, writing the information of the nearest computing node into a message, and transmitting the message to a resource management Agent of the next computing node; when the next computing node can not meet the resource requirement of the component program to be deployed, writing the information of the next computing node into the message, and transmitting the message to the resource management Agent of the next computing node until the resource sum of the computing nodes with the registered information meets the resource requirement of the component program to be deployed; and the last computing node registering the information transmits a message back to the component execution Agent of the component program to be deployed, and the component execution Agent deploys the component program to the registered computing nodes meeting the requirements and starts to run. Compared with the existing computing resource management method, the method for transmitting messages among the agents and searching for the proper computing resource reduces the burden of the computing node where the scheduled component is located and accelerates the speed of searching for the computing resource meeting the conditions. After the component is deployed, the component execution Agent and the component managed by the component execution Agent are in one-to-one correspondence. For example, if component A runs in parallel in MPI multiprocesses, then one instance of A is running in each MPI process. Each instance of A has a corresponding instance of A's component execution Agent distributed on the compute node where each MPI process resides.

Since the component execution agents of the components are operated independently, a situation that a plurality of components are deployed to the same computing node to operate may occur. To prevent some compute nodes from being overburdened and reduce resource contention among components, the component builder is required to assign a scheduling priority to the component process. The scheduling priority of the component program is divided into 3 types of low, medium and high. The scheduling priority of the component is initialized to the scheduling priority of the component program to which the component belongs, and the scheduling priority is stored in the component execution Agent. The resource management Agent of each compute node maintains a queue of component tasks to be executed. The queue stores the components which are already deployed to the computing node. When the computing node has free CPU processor resources, a task is taken out from the queue to be executed. If the number of tasks in the task queue of one computing node exceeds 3 times of the number of CPU processor cores of the computing node, no new components are deployed on the computing node. The component tasks in the queue are ordered by priority. And the components with the same priority select the components with the shorter expected running time to be scheduled first so as to improve the throughput of the system. The input data size of the component and the corresponding historical run time are stored in a component repository. After the component program is built, according to the scale of the actual input data, the corresponding historical running time is found in the information base and is stored in the component execution Agent as the expected execution time. If there is no corresponding historical data, a component builder gives an estimate of the run time empirically. When a component enters the queue to be executed, its component execution Agent will start a timer, and if the low priority component has not acquired computing resources and executed after 12 hours, its priority will become medium. If the medium priority component has not acquired computing resources and executed after 12 hours, its priority will become high. This manner of dynamically changing priority prevents certain low priority components from being left unexecuted for long periods of time. Fig. 2 gives a schematic view of the deployment process of the components.

Step four: designing 4 self-adaptive agents, and deploying parallel component programs on computing nodes meeting requirements in a self-adaptive mode through mutual cooperation of the 4 self-adaptive agents, the component execution agents and the resource management agents;

adaptive strategies for building blocks include dynamically changing parallelism of building blocks, changing data partitioning, building block migration, and changing implementation. Accordingly, the adaptive agents include a parallelism adaptive Agent, a data partitioning adaptive Agent, a component migration adaptive Agent, and a change realization adaptive Agent. When selecting a building block building program, a building block builder specifies whether a building block needs to incorporate an adaptive strategy. If necessary, the Agent management system generates adaptive agents for the component according to different strategies, and the adaptive agents and the component are deployed on the same computing node when the component is deployed. An adaptation Agent contains two parts of content, namely an event for triggering adaptation and a response of the Agent to the event.

The method for deploying the parallel component program on the computing node meeting the requirements in a self-adaptive mode through mutual cooperation of the parallelism self-adaptive Agent, the component execution Agent and the resource management Agent comprises the following steps:

a component program builder sets a load threshold, a time threshold and the number n of computing nodes before running a component, and writes the load threshold, the time threshold and the number n into a parallelism self-adaptive Agent; generally, the event triggering the change of the parallelism of the component is the case that the running time of the component is longer and the available resources (mainly the number of CPU cores) are more. When a component program is initially distributed to m CPU processor cores to run, the parallelism adaptive Agent times the running time of the component, and simultaneously sends load query information to the component execution Agent of one adjacent computing node; the load query information comprises the information of the computing nodes of the deployed components. And checking the load condition of the resource management Agent of one adjacent computing node, if the load condition is higher than a load threshold value, sending the load query information to the next computing node, otherwise, writing the resource information of the adjacent computing node into the load query information, and continuously transmitting the load query information to the next computing node until the sum of the number of CPU processor cores of the computing nodes which are contained in the load query information and are lower than the load threshold value meets the requirement of the parallelism self-adaption Agent. In order to accelerate the process of parallelism self-adaptation, each new computing node writes own information into the query message, and simultaneously selects a computing node which is closer to the new computing node according to the computing node information of the deployed components in the message, and sends the message to the component execution Agent on the computing node to request the component execution Agent to send the copy of the component to the new computing node. And after receiving the message, the member execution Agent immediately responds to the request, sends a copy of the member for the newly added computing node, and simultaneously sends a copy of the member execution Agent and the parallelism self-adaptive Agent. Once the parallelism self-adaptive Agent finds that the running time of the component program exceeds a time threshold, the number of available CPU processor cores meets the requirement of the parallelism self-adaptive Agent, the parallelism self-adaptive Agent makes a self-adaptive decision, the execution of the component program is suspended, unexecuted data is distributed again, and the component program is distributed to m + n CPU processor cores to be executed in parallel. If the components run in parallel using MPI multiprocessing, changing the parallelism of the components will be achieved by MPI _ Comm _ Spawn operation. If the components are run in parallel using OpenMP or similar shared memory, the change in parallelism is implemented by fork operation. The working process of the parallelism adaptive Agent is shown in FIG. 3.

the trigger for changing the data partitioning adaptive strategy is often caused by the inconsistent processing speed of each computing node executed in parallel. Those compute nodes that are fast and lightly loaded tend to complete their assigned data blocks earlier in operation than other compute nodes. At this point, the other compute nodes are far from completing the task. After each computing node finishes the distributed tasks, the data division self-adaptive Agent detects the ratio of the finished data volume to the total data volume; when the ratio is smaller than 1/10, triggering and changing a self-adaptive strategy of data division, suspending parallel execution of component programs by the self-adaptive Agent of data division, collecting unfinished data tasks, and performing division and execution again; when the ratio is greater than 1/10, the data partitioning adaptive function of the component program operation is turned off, and the remaining part of the component program operation does not perform the detection of the data amount any more.

component migration is often due to some failure of the compute node on which the component is running. In this case, if the intermediate results are not saved in advance, the component needs to be deployed to other compute nodes for re-execution. If the performance of the whole computing platform is unstable, the condition that a computing node fails frequently occurs, before component scheduling, a component migration adaptive Agent is distributed by an Agent management system, and codes for periodically storing intermediate results to external files are arranged in the component migration adaptive Agent. Meanwhile, selecting a stable computing node as a backup computing node, and calling a member migration adaptive Agent to send a copy of a member to the backup computing node, wherein the copy comprises the member execution Agent and the copy of the member migration adaptive Agent; in the operation process, if the computing node fails due to failure, the intermediate result is taken out, the component instance on the backup computing node is started, and the component task is continuously completed.

the premise of changing the implementation when the component runs is that the component has the same function and different component versions are implemented. For example, the same mathematical operation function may have both a general version suitable for normal server operation and a special version capable of accelerating processing on the GPU. In the running process of a component program, changing the hardware condition for realizing the active detection of each computing node by the self-adaptive Agent; if the computing node which contains the GPU and is light in load is found, the self-adaptive Agent is changed to suspend the execution of the component, the intermediate result is saved, then the component execution Agent is called, and the component of the special version is deployed to the computing node which is light in load to continue to run.

Under the control of the self-adaptive Agent and the resource management Agent on different computing nodes, 4 different parallel component self-adaptive processes are completed. Compared with the existing centralized scheduling method, the distributed self-adaptive method reduces the burden of the operation control computing node. Meanwhile, the method for completing the self-adaptive process by the parallel execution of different computing nodes has better performance than other existing methods.

Step five: the resource management Agent requests the Agent management system to generate a load detection Agent for the resource management Agent, and the component tasks on the high-load computing nodes are transferred to the low-load computing nodes to be executed through the cooperative operation of the resource management Agent, the load detection Agent and the component execution Agent, so that the throughput and the performance of the Agent management system are improved.

Starting a load balancing mechanism of an Agent management system, defining 2 parameters, a load detection period and an upper limit of a load by a constructor (a user of a computing platform) of a component program; taking the ratio of load average returned by the examination top command to the number of cpu cores as the load condition of the computing node where the component program is located; the upper limit of the load can be customized, but is typically 0.7 as a threshold. The resource management Agent on each computing node periodically detects the load condition of the computing node, and if the load of the computing node is found to be larger than the upper limit of the load during detection, the resource management Agent informs component execution agents of all component programs on the computing node, suspends the execution of the component programs, and stores intermediate results in the component execution agents. The resource management Agent requests the Agent management system to generate a load detection Agent for the resource management Agent, the load detection Agent moves autonomously in the computing platform, load information of each computing node is collected, then the load detection Agent returns to a source computing node initiating load detection, and the load information stored on the source computing node is updated; the method for collecting the load information of different computing nodes on the platform by using the Agent greatly reduces the burden of the source computing node and has higher performance. Generally, 3 resource utilization indexes of the computing node are considered, namely CPU utilization rate, memory utilization rate and network bandwidth. Assuming that the platform has n computing nodes, the load detection Agent will return the 3 index data of each computing node. Table 1 is the information returned by the load detection Agent at a certain run.

TABLE 1 information returned by the load detection Agents

As can be seen from table 1, the load probing passes through 5 computing nodes in total, and the table records IDs, IP addresses, and 3 index data of the 5 computing nodes. Since the 3 indexes are collected based on different hardware in a certain period of time, it is not very accurate to directly use the 3 indexes to evaluate the load condition of each computing node. Assuming that the 3 indexes have the same weight on the load influence of the computing node, the 3 indexes need to be normalized next. The method comprises the following steps:

suppose that a certain load probe returns information for n compute nodes, L_iCPU represents the original CPU utilization of the ith compute node, P_iThe CPU represents the CPU utilization after the i-th compute node is normalized. Xcpu represents the maximum value of CPU utilization among the n compute nodes. Mcpu represents the minimum value of CPU utilization among the n compute nodes. The formula for normalizing the CPU utilization is as follows:

P_icpu＝(L_icpu-Mcpu)/(Xcpu-Mcpu) (1)

L_imem represents the original memory utilization, P, of the ith compute node_iAnd mem represents the memory utilization rate after the i-th computing node is normalized. Xmem represents the maximum value of memory utilization among the n compute nodes. Mmem represents the minimum value of memory utilization among the n compute nodes. The formula for normalizing the memory utilization is as follows:

P_imem＝(L_imem-Mmem)/(Xmem-Mmem) (2)

L_inet represents the original network bandwidth, P, of the ith computing node_inet represents the normalized network bandwidth of the ith computing node. Xnet represents the maximum value of the network bandwidth among the n computing nodes. Mnet represents the minimum value of the network bandwidth among the n computing nodes. The formula for normalizing the network bandwidth is as follows:

P_inet＝(L_inet-Mnet)/(Xnet-Mnet) (3)

according to these 3 formulas, the normalized resource load index is used to replace the original data, and table 2 can be obtained.

Table 2 load information normalization

In this case, a load integration index L of the i-th computing node is defined_iIs the sum of its normalized 3 resource load metrics. Namely:

L_i＝P_icpu+P_imem+P_inet (4)

L_ithe larger the value of (A)The heavier the load representing the ith compute node. L can be obtained by using Table 2₁To L₅I.e. the final index data evaluating the load situation of these 5 computing nodes. As shown in table 3, the heaviest load is the compute node 3, and the overall load index is 2.5. The lightest load is the computing node 4, and the comprehensive load index is 0.19. At this time, the component task on the source computing node may be migrated to the computing node 4, which is the computing node with the smallest load, under the control of the component execution Agent.

TABLE 3 load integration index

In the process, the load detection Agent collects load information, the resource management Agent is responsible for calculating and comparing load data, and the component execution Agent is responsible for specific component task migration work. In order to further optimize the whole process, after the load detection Agent returns load information to the source computing node, the load detection Agent is deleted by the Agent management system, and resources occupied by the load detection Agent are released, so that the resources of the Agent management system are saved.

Simulation experiment

The invention provides a parallel component program assembly and performance optimization method based on Agent for inspection, and relevant experiments are carried out. The experimental platform is a heterogeneous computer cluster, and comprises 32 SMP servers (CPU intel J3060, memory 8G), 2 servers with GPUs (GPU TESLA K80, memory 24G), 1 multi-core server with 8 cores (CPU intel i7-9700, memory 16G) and 1 multi-core server with 16 cores (CPU intel E5-2682V4, memory 32G) inside. The servers are all installed with Linux operating systems (Fedora 32Server), and are connected through Ethernet.

In order to simulate daily working conditions of a computer cluster, 10 different component application programs developed by a cca-tools parallel component development toolkit are selected as tested programs. Table 4 is basic information of these 10 programs. The 10 programs are assembled on a multi-core server with 16 cores in a connecting mode and then deployed to a computer cluster by a component execution Agent to run.

TABLE 4 basic information of the tested program

The first run was a building block procedure assembly experiment. The CCA _ SPM component program consists of 3 components. The first Preprocessing component is implemented in a C + OpenMP manner (1 process is initially allocated) and is used to preprocess an input image. The second component, Model _ estimate, is implemented in a C + MPI + OpenMP manner (initially allocating 8 processes) for Model estimation. The third component, View, is implemented using Python (1 process is initially assigned) to display the results of image processing. In the component connection process, the generated 3 component connection agents detect the interface condition of the components and exchange information with each other, and find that M × N data redistribution needs to be carried out between Preprocessing and Model _ estimate, and interface bonding and M × N data redistribution need to be carried out between Model _ estimate and View. When the Preprocessing and the Model _ estimate are connected, the component connection Agent of the Model _ estimate component generates data redistribution code, which is used by MPI _ Scatter to distribute the data received from the Preprocessing from 1 process to 8 processes. When the Model _ estimate and the View are connected, the interface called by the Model _ estimate is int Results _ View (int MPicture [128] [128] [1] [36 ]). The mpicuture parameter is the processed image array. The interface of the View service is int View (int MRIP [128,1, 36], int is _ overlay), the MRIP parameter is an input image array, when the is _ overlay parameter takes 1, the result image covers the original image, when 0 is taken, the result image is not covered, and when the default is not covered. According to the interface bonding method provided by the invention, the two interfaces are converted into an SIDL interface int Results _ view (int MRIP [128,1, 36], int is _ overlay ═ NULL), so that two components can be connected through the interfaces. The component connection Agent generates data redistribution codes for the Model _ estimate component, and collects image data processing results on 8 processes to 1 process by using MPI _ Gather. The CCA _ SPM component program assembly process uses the component connecting Agent provided by the invention, and interface bonding and data redistribution codes are generated in the component assembly process, so that the matching connection and data redistribution of the interfaces are completed.

Next, experiments relating to the adaptive mechanism of the building block were performed. In order to check the effect of the Component adaptive mechanism provided by the invention, a Component Background is selected from the MM5_ Component program, and a parallelism adaptive mechanism is added. By contrast, a Background component with the adaptation function of Concerto is implemented manually. The adaptive mechanism of concierto is characterized in that the parallelism of component operation is not specified by a component program builder, but resource detection is performed before operation, the parallelism of component operation is determined according to the condition of available resources, and the parallelism is executed all the time in the component operation process. The parallelism self-adapting mechanism provided by the invention firstly specifies the running parallelism of a component by a component program builder. And then, in the running process, the dynamic parallelism change is carried out according to the situation of the actual available resources. Obviously, if the constructor can specify a more appropriate parallelism, for example, equal to or close to the parallelism determined by the concierto through resource detection, the method provided by the present invention can reflect the change of the resource in operation, and achieve better performance than the concierto method.

FIG. 4 is the results of testing the Background component. In the experiment of fig. 4, only a certain number of SMP compute nodes, 2, 4, 8, and 16 SMP servers, respectively, were started at the time of component deployment. Then, during the operation of the component, other SMP compute nodes are restarted. The parallelism obtained by the initial detection of the component of the Concert version and the initial parallelism specified by the component program builder in the optimization mechanism provided by the invention are 2, 4, 8 and 16. The optimization mechanism provided by the invention can dynamically increase the parallelism along with the increase of the computing nodes in the operation process, so that better performance is obtained.

The selection of the Component VERinter in the MM5_ Component program incorporates a change data partitioning adaptation mechanism. Existing parallel component program performance optimization methods are less involved in dynamically changing data partitioning, so use is made of verinters compared to the original version without any performance optimization mechanism added. Meanwhile, in order to better simulate the operation situation on an actual computer cluster, the MM5_ Component program is operated, and the other 9 Component programs are deployed and operated on the computing platform simultaneously while the performance of the VERInterer is measured. Because the load on the whole cluster is not balanced, part of computing nodes load more components, the performance is reduced, different computing nodes which execute the VERInterer components in parallel are caused, the computing nodes with lighter load tasks finish the data blocks distributed by the computing nodes earlier than other computing nodes during the operation, and the change of the data division self-adaptive mechanism is triggered. The unexecuted data is divided again, the processing capacity of the high-speed computing node is effectively utilized, and the performance of the component program is improved. Figure 5 is the results of experiments performed on the VERinter component. As can be seen from fig. 5, the method provided by the present invention can improve the performance of the component, and when the data volume of the input data is large, the performance improvement caused by the data repartitioning is more obvious than that of the original version.

A Component migration adaptive mechanism is added into 8 components in the MM5_ Component program, then the MM5_ Component program is independently operated, and a single computing node is manually closed in the operation process. And after a single computing node fails, the component migration self-adaptive Agent can start a backup computing node, recover an intermediate result and complete the processing task of a component program.

To test the adaptation mechanism for the proposed change implementation of the present invention, Topographic components in the MM5_ Component program were selected and two different versions were prepared. Besides the common version realized by the C + MPI, a GPU-based C + OpenCL version is also realized. By contrast, a Topographic component with the capability of predicting and adapting to the ICENI was manually implemented. The ICENI predicts the performance of its different versions on the currently available resources before the component runs and then selects the version with the better performance to run. Once a version is selected, no further version changes are made during runtime. The change provided by the invention realizes a self-adaptive mechanism, directly specifies the GPU-based C + OpenCL version as a high-speed version, and does not need performance prediction. As long as the system has the GPU computing nodes with light loads, the version member is called to run on the GPU, the overhead of performance prediction is saved, and the performance is better than that of the ICENI version. Meanwhile, on a real computer cluster platform, hardware with a special acceleration function is often unavailable when a component is initially deployed. In this experiment, to simulate this situation, the building block XMD _ GPU was first scheduled to run on a unique GPU compute node. Deployment and operation of the Topographic member is then initiated. For an ICENI version of the program, the Topographic component would still be deployed onto the GPU compute node. But it needs to wait for the XMD _ GPU to finish executing before starting its own work task. However, if the adaptation mechanism is implemented by using the change proposed by the present invention, when Topographic is deployed for the first time, since the GPU compute node is occupied by XMD _ GPU, Topographic will select the common version implemented by C + MPI to run on other compute nodes. At the same time it will periodically probe the availability of the GPU compute nodes. And when the GPU computing node is idle, suspending the execution of the Topographic component of the common version, extracting an intermediate result, and selecting the Topographic component of the C + OpenCL version to be scheduled to the GPU for continuous operation. Fig. 6 shows the performance of the ICENI version and Topographic component with the adaptation implementation adaptation mechanism proposed by the present invention at different input data scales in the case where GPU compute nodes are not available at the time of first deployment of the component. Obviously, on a real computer cluster, the proposed change implements an adaptive mechanism with better performance than the ICENI version.

In order to check the effectiveness of the load balancing method provided by the invention, a system using a traditional load balancing mechanism based on centralized control is compared with an operating platform starting the load balancing mechanism provided by the invention. For a centralized load balancing system, load detection, load policy generation and actual load migration work are all completed under the control of a root computing node, the task of the root computing node is heavy, and the execution efficiency of the load balancing work is relatively low. For the system using the load balancing mechanism provided by the invention, the period of load detection is defined to be 20 minutes, and the upper limit of the load is 0.7. In these two comparative operating environments, 2, 4, 8 and 10 of the previously selected 10 component programs were assembled and submitted, respectively, and the corresponding component tasks were 14, 33, 63 and 73, respectively. The results of the load balancing comparative experiment are given in fig. 7. As can be seen from fig. 7, the load balancing mechanism provided by the present invention can reasonably balance the load on each computing node, reduce the total running time of all component tasks, and improve the throughput of the system. Moreover, the effect of load balancing on improving the system performance is more obvious when the number of component tasks is large and the load of the whole system is heavy. The load balancing mechanism provided by the invention mainly cooperates with the resource management Agent distributed on each computing node, the component execution Agent of the component to be migrated and the autonomous load detection Agent to finish the whole load balancing work together. Compared with a centralized control load balancing mechanism, the method has higher performance. Meanwhile, the load balancing mechanism provided by the invention also has certain performance cost. The detection of loads and the migration of component tasks require a certain time and computational resource overhead. When the number of the component programs loaded by the whole system is 8 and 10, namely the number of the component tasks is 63 and 73, the starting of the load balancing function of the whole system brings great improvement to the overall performance of the system. The cost of maintaining load balancing is relatively small, and the benefit of load balancing is relatively large.

The invention provides a parallel component program assembly and performance optimization method based on an Agent technology by combining the characteristics and advantages of the Agent technology on the basis of analyzing the existing parallel component performance optimization technology. Agents with specific functions are defined as C + + classes in advance, and when the agents need to be used, instances of the specific classes are generated through an Agent management system, so that the agents with different functions can be generated. The component connecting Agent translates component interfaces written in different languages by calling a Babel tool, and generates corresponding bonding codes to bond the interfaces with unmatched interface names, parameters or return value types. The component connection Agent is also able to generate an adhesive code that supports mxn data redistribution as needed. The component execution Agent manages the dependency relationship between the components. The method collects information of computing resources meeting the operating requirements of the components through interaction with a resource management Agent, and deploys the components to the specific computing resources. The component program composer specifies an initial scheduling priority of the component program. The component execution Agent manages the priority of the component. And the resource management Agent of the computing node maintains a task queue of the component to be executed. The component tasks waiting to be executed are executed in order of priority and expected execution time. In order to improve the performance of a parallel component program, the invention provides a method which can complete 4 different component self-adaptive processes of dynamically changing the parallelism of a component, changing data division, component migration and changing under the mutual cooperation of a self-adaptive Agent, a component execution Agent and a resource management Agent. Aiming at the load imbalance condition possibly occurring in the heterogeneous cluster platform, the invention provides that the component tasks on the high-load computing nodes can be transferred to the low-load computing nodes to be executed through the joint cooperation of the resource management Agent, the load detection Agent and the component execution Agent, and the throughput and the performance of the whole system are improved. The effectiveness of the method provided by the invention is proved by assembling and running experiments on 10 parallel component programs on a heterogeneous computer cluster. Compared with the traditional performance optimization method, the method based on the Agent technology is flexible to use and has performance advantages.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A parallel component assembling and performance optimizing method based on Agent technology is characterized by comprising the following steps:

2. The Agent technology-based parallel component assembling and performance optimizing method according to claim 1, wherein in the second step, the specific operation method is as follows:

3. The Agent technology-based parallel component assembling and performance optimizing method according to claim 1, wherein in the third step, the specific operation method is as follows:

4. The Agent technology-based parallel component assembly and performance optimization method according to claim 3, wherein the adaptive agents comprise parallelism adaptive agents, data partitioning adaptive agents, component migration adaptive agents and change realization adaptive agents.

5. The Agent technology-based parallel component assembling and performance optimizing method according to claim 4, wherein the method for deploying the parallel components on the computing nodes meeting the requirements in a self-adaptive manner through mutual cooperation of the parallelism self-adaptive Agent, the component execution Agent and the resource management Agent is as follows:

6. The Agent technology-based parallel component assembling and performance optimizing method according to claim 4, wherein the method for adaptively deploying the parallel component program on the computing node meeting the requirement through mutual cooperation of the data division adaptive Agent, the component execution Agent and the resource management Agent comprises the following steps:

7. The Agent technology-based parallel component assembling and performance optimizing method according to claim 4, wherein the method for adaptively deploying the parallel component program on the computing node meeting the requirement through mutual cooperation of the component migration adaptive Agent, the component execution Agent and the resource management Agent comprises the following steps:

8. The Agent technology-based parallel component assembling and performance optimizing method according to claim 4, wherein a method for realizing mutual cooperation of an adaptive Agent, a component execution Agent and a resource management Agent to adaptively deploy a parallel component program on a computing node meeting requirements is changed into the following steps:

9. The Agent technology-based parallel component assembling and performance optimizing method according to claim 1, wherein in step five, the specific operation method is as follows: