CN117441161A - Software optimization method and equipment of NUMA architecture - Google Patents

Software optimization method and equipment of NUMA architecture Download PDF

Info

Publication number
CN117441161A
CN117441161A CN202280005934.7A CN202280005934A CN117441161A CN 117441161 A CN117441161 A CN 117441161A CN 202280005934 A CN202280005934 A CN 202280005934A CN 117441161 A CN117441161 A CN 117441161A
Authority
CN
China
Prior art keywords
variables
functions
memories
core
memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202280005934.7A
Other languages
Chinese (zh)
Inventor
马可·迪纳塔莱
恩里科·比尼
亚历山德罗·德鲁埃托
安德里亚·格罗索
西尔维奥·巴奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN117441161A publication Critical patent/CN117441161A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5044Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering hardware capabilities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/48Indexing scheme relating to G06F9/48
    • G06F2209/483Multiproc
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/502Proximity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/506Constraint

Abstract

The present invention relates to an apparatus and method for deploying software on a multi-core hardware platform. The platform includes a plurality of cores and a memory. The software includes a plurality of functions that share a plurality of variables. The device is configured to map the variable to the memory and the plurality of functions to the plurality of cores. The device may obtain a mapping of the variables and a mapping of the functions from the BILP problem. Alternatively, the device may be configured to bind the variable to the function and allocate the bound variable to a local memory of the corresponding core. In this way, the resources of the platform can be more efficiently utilized according to the joint mapping of the variables and the functions.

Description

Software optimization method and equipment of NUMA architecture
Technical Field
The invention relates to the technical field of computers. For example, the present invention relates to a software optimization apparatus and method for a multi-core system.
Background
Multi-core (or multiprocessor) systems are widely used to control various application scenarios for real-time applications. Non-uniform memory access (NUMA) is a computer memory architecture for multi-core systems. Figure 6 shows an abstract model of a NUMA architecture in which each processor, also referred to herein as a core or central processor (central processing unit, CPU), is directly coupled to its own local random access memory (local random access memory, LRAM). In addition, multiple processors may typically share at least one global random access memory (global random access memory, GRAM). Unlike unified memory access architectures, where each processor has the same memory access time, a processor of the NUMA architecture can access its LRAM faster than non-native memory. It should be noted that LRAM in the present invention is different from a CPU cache, which is located inside a CPU and is not considered as a memory accessible to other CPUs. In contrast, LRAM in the present invention can be accessed by all CPUs connected through a bus. The non-local memory may include memory local to another processor, or global memory shared among multiple processors, such as GRAM. The NUMA architecture provides a separate memory allocation for each processor (or group of processors) in a multiprocessor system, thereby avoiding performance degradation when multiple processors attempt to address the same memory.
For example, in the automotive field, an electronic control unit (electronic control unit, ECU) of a motor vehicle may typically comprise a plurality of cores. Modern vehicles are now capable of supporting many complex functions such as infotainment, navigation, security, advanced engine emission control, image detection based services, and autopilot. Therefore, the number of ECUs embedded in a vehicle is steadily increasing, and can reach more than 100 ECUs. A standardized architecture called an automatic open system architecture (automotive open system architecture, AUTOSAR) is widely adopted to standardize hardware and software architecture used in modern vehicles, with the aim of simplifying development and integration of vehicle-related software functions. Since AUTOSAR 4.0, multi-core support was introduced. For example, an Inter-operating system application communicator (Inter OS-application communicator, IOC) is provided as a connection to the runtime environment (runtime environment, RTE) of the software architecture to map cross-core and memory communications. By running complex functions on a multi-core system (e.g., with a NUMA architecture), system performance may be improved. It should be noted that the functions of a plurality of ECUs may be implemented in a single ECU. For example, infotainment and navigation may be implemented separately on different ECUs, or may be implemented jointly on the same ECU.
Disclosure of Invention
Currently, at the software level, variables (or labels) and functions (or runnable) may be mapped to specific cores for execution according to predictive algorithms or the like. However, NUMA architecture presents challenges in terms of predictability. Thus, mapping software functions and shared variables into cores is complex and time consuming.
For example, embedded system applications for multiprocessors may typically include thousands of functions that communicate with each other in a well-connected mode using shared variables. Mapping functions and variables to cores while ensuring optimal system performance is not easy.
For example, during the software-to-hardware deployment phase, two related problems can be solved: assigning task functions to cores, and placing shared variables in memory. This may have a greater impact on the NUMA platform because each core may have a different fastest memory access time. Improper deployment may negatively impact the performance and correctness of the hardware system. These two problems are also highly relevant, especially on NUMA platforms. Depending on the placement of variables in memory and the associated functional allocation to cores, memory operations may cause additional stall cycles. Thus, the time required to perform a given task function using the relevant variables may vary widely according to different schemes.
A typical software application may include on the order of thousands of functions and ten thousand shared variables. The target platform may include multiple cores, as well as multiple local RAM and global RAM. This makes it impossible to obtain a quality-guaranteed software deployment solution manually or manually. A typical software deployment may include assigning functionality to cores. One of the fundamental challenges in designing software for a multi-core system is ensuring efficient use of available computing, communication, and memory resources. For example, in AUTOSAR, finding proper scheduling and corresponding time critical data transmission execution is a challenging task when the IOC operates using a transmitter-receiver protocol (e.g., by implementing a data memory buffer), e.g., when attempting to achieve a high degree of parallelism running on multiple cores. In addition, handling communications between functions running in parallel on different cores becomes more difficult and may require complex software deployment procedures to ensure safe and reliable operation of the overall system. For example, the vehicle control system may need to ensure that the safety functions are not blocked by other less urgent functions (e.g., navigation, etc.).
In view of the above, the present invention is directed to improving the efficiency of automated software deployment. The goal is to ensure efficient and/or optimal utilization of the available resources. It is another object of the present invention to provide a scheme for automatically obtaining a mapping of embedded applications to a multi-core system. It is a further object of the present invention to provide a faster, more robust and predictable software deployment/configuration scheme for real-time applications.
These and other objects are achieved by the solution of the invention described in the independent claims. Advantageous implementations are further defined in the dependent claims.
A first aspect of the invention provides an apparatus for mapping a plurality of functions sharing a plurality of variables to a multi-core computing system. The multi-core computing system includes a plurality of cores and a plurality of memories. Each core is coupled to a memory of the plurality of memories. The apparatus is for assigning each of the variables to one of the plurality of memories according to one or more characteristics of the plurality of memories to obtain an assignment of the plurality of variables to the plurality of memories. The apparatus is also for mapping a plurality of functions to a plurality of cores based on the allocation of the plurality of variables to the plurality of memories.
Software deployment in the present invention may be understood as inferring function-to-core (or allocation) and variable-to-memory mappings.
By assigning shared variables to memory and mapping a plurality of functions to a plurality of cores based on the assignment of the plurality of variables to the plurality of memory, the impact of one or more characteristics of the plurality of memory on the shared variables is considered. The method has the advantage that the efficiency is improved due to the fact that the variable is distributed to the memory, so that the overall performance of the multi-core system can be improved.
In one implementation of the first aspect, the one or more characteristics of the plurality of memories may include an access time for each memory.
By taking into account the access time of each memory, variables that are more frequently used or have a greater impact on system performance can be assigned to faster memories. The overall resource utilization of the multi-core system may be optimized.
In an implementation manner of the first aspect, the device may be further configured to divide the variable with respect to the plurality of functions to obtain a binding relationship between the variable and the functions. The device may also be configured to map the plurality of functions to the plurality of cores further according to a binding relationship between the variables and the functions.
Alternatively, the binding relationship between the variables and the functions may include a one-to-one relationship, or a many-to-one relationship. Optionally, at most one function is bound per variable. Multiple variables may be bound to the same function.
Optionally, to map multiple functions to multiple cores further according to the binding of variables to functions, the device may be used to determine the mapped cores on which the functions are executed to save maximum execution time of the functions. When a function is mapped to a particular core, the device may also be used to map one or more variables bound to the function in a binding relationship to memory that is closest or fastest to access time relative to the particular core.
By binding the variables to the function, the allocation of multiple variables to multiple memories may be accomplished along with the allocation of functions to cores. The advantage is that the available computing resources and memory resources of the multiple cores can be utilized more efficiently.
In one implementation of the first aspect, for partitioning variables, the device may be configured to, for each function:
-obtaining a frequency of executing the function;
-associating one or more of the plurality of variables with the function according to the frequency of the function.
Alternatively, the frequency at which the function is performed may be understood as the maximum execution frequency of the function. The maximum execution frequency of each function may be predefined or preconfigured. The maximum execution frequency may be provided as an input to the device, e.g., by the provider of each function.
Alternatively, the device may be configured to calculate the number of execution cycles that can be saved in one call per function assuming that each variable maps to local memory, which is the memory closest to the core executing each function. The device may be used to compare different hypotheses and associate one or more variables with each function in a manner that may save a maximum total execution time. The advantage is that more resources can be allocated to functions that are performed more frequently. Thus, hardware resources can be efficiently utilized, and overall system performance is improved.
In one implementation of the first aspect, the one or more features of the plurality of memories may include a size of each memory, the device being configured to associate each variable further according to the size of the variable and the size of each memory.
By considering the size of each memory, the variables can be allocated in a manner that efficiently utilizes the total capacity of the memory according to the size of the variables and the size of each memory. In addition, no memory overload can be guaranteed.
In one implementation of the first aspect, the device may be configured to map the function to the plurality of cores further according to a number of cycles per function required to execute the function at each core.
In one implementation of the first aspect, the device may be configured to map functionality to a plurality of cores in a manner that each core is not overloaded.
Alternatively, multiple cores may have different performance and power consumption. For example, the plurality of cores may include at least one high performance core and at least one low power core. The device may be used to map functions with more execution cycles to high performance cores and functions with less execution cycles to low power cores.
In one implementation of the first aspect, the device may also be configured to combine two or more functions of the same core into a task according to a release pattern of the two or more functions of the same core.
Alternatively, two or more functions in combination should not hang themselves. That is, one function does not need to wait for the other to complete.
In one implementation of the first aspect, the apparatus may be further configured to assign a priority to each task to minimize resource utilization of each core.
This may ensure that limited and/or better resources (e.g., faster CPU and faster memory) may be allocated to high priority tasks. Therefore, by ensuring the stability of the high priority tasks, the system performance can be further improved.
In one implementation of the first aspect, the apparatus may be configured to determine the priority of each task according to an expiration date of the task.
Alternatively, the expiration date may be understood as a specific period of time that the task needs to be completed.
One benefit of using an expiration date is to ensure that specific timing constraints for each task are met.
In one implementation of the first aspect, the apparatus may be configured to determine the priority of each task further based on interference caused by one or more other tasks.
Alternatively, the device may assign tasks that may cause greater interference with lower priority. Thus, more disturbing tasks may be performed after other tasks are completed to ensure smooth system performance. In this way, interference between tasks can be reduced.
Alternatively, the device may use the disturbance as a constraint in determining the allocation of variables and functions.
In an implementation manner of the first aspect, the device may be configured to determine the priority of each task further according to a blocking time when the task is in a waiting state.
Alternatively, the device may use the blocking time as a constraint in determining the allocation of variables and functions.
In one implementation of the first aspect, to map a plurality of functions to a plurality of cores, the apparatus may be to: grouping two or more functions into a single clustered function, wherein the two or more functions share one or more common partitioning variables; a single cluster function is mapped to one of the cores.
Alternatively, the device may be used to group two or more runnable programs and their bound labels into a single clustering function. The single clustering function may have a customizable size. In the present invention, this scheme may be referred to as hierarchical clustering.
By using a single clustering function (or simply clustering) as a basic unit for mapping multiple functions to multiple cores, the number of parameters involved in mapping optimization can be reduced. Clustering has the advantage that the operability of software deployment optimization can be ensured.
In one implementation of the first aspect, the plurality of memories may include a global memory (or GRAM) and a plurality of local memories (or LRAMs). Global memory is shared by multiple cores, with each local memory being directly coupled to one of the multiple cores.
Optionally, after binding one or more variables to a function, the device may be used to map multiple functions to multiple cores. The device may also be configured to map one or more binding variables to an LRAM of the core. Because of the binding relationship, the mapping of one or more binding variables may be done implicitly.
Optionally, the device may be further configured to assign one or more common partition variables associated with the clustering function to a local memory directly coupled to the mapped cores.
In one implementation of the first aspect, the plurality of functions may be a plurality of executable programs of the software component in a runtime environment, and the plurality of shared variables may be inputs of the software component.
A second aspect of the invention provides a method for mapping a plurality of functions sharing a plurality of variables to a multi-core computing system. The multi-core computing system includes a plurality of cores and a plurality of memories. Each core is coupled to a memory of the plurality of memories. The method is performed by an apparatus, comprising the steps of:
-the device assigning each of the variables to one of the plurality of memories according to one or more characteristics of the plurality of memories to obtain an assignment of the plurality of variables to the plurality of memories;
-the device maps the plurality of functions to the plurality of cores according to the allocation of the plurality of variables to the plurality of memories.
In one implementation of the second aspect, the one or more characteristics of the plurality of memories may include an access time for each memory.
In one implementation manner of the second aspect, the method may further include:
the device divides the variable relative to the plurality of functions to obtain the binding relation between the variable and the functions;
the device further maps the plurality of functions to the plurality of cores according to a binding relationship between the variables and the functions.
In one implementation manner of the second aspect, the step of dividing the variable may include: for each function:
-the device obtains a frequency of executing the function;
the device associates one or more of the plurality of variables with the function according to the frequency of the function.
In one implementation of the second aspect, the one or more characteristics of the plurality of memories may include a size of each memory, and the method may include the device associating each variable further according to the size of the variable and the size of each memory.
In one implementation of the second aspect, the method may include mapping the function to a plurality of cores further according to a number of cycles per function required to execute the function at each core.
In one implementation of the second aspect, the method may include mapping functionality to a plurality of cores in a manner that each core is not overloaded.
In one implementation manner of the second aspect, the method may include: the device combines two or more functions of the same core into a task according to a release pattern of the two or more functions of the same core.
In one implementation manner of the second aspect, the method may further include: the device assigns a priority to each task to minimize the resource utilization of each core.
In one implementation manner of the second aspect, the method may further include: the device determines the priority of each task based on the deadline of the task.
In one implementation of the second aspect, the method may further include the device determining the priority of each task further based on interference caused by one or more other tasks.
In one implementation of the second aspect, the method may include the device further determining the priority of each task based on a blocking time that the task is in a waiting state.
In one implementation manner of the second aspect, the step of mapping the plurality of functions to the plurality of cores may include: the device groups two or more functions into a single cluster function; a single cluster function is mapped to one of the cores. Two or more functions may share one or more common partition variables.
In one implementation of the second aspect, the plurality of memories may include a global memory and a plurality of local memories. Global memory is shared by multiple cores, with each local memory being directly coupled to one of the multiple cores. The method may further include assigning one or more common partition variables associated with the clustering function to a local memory directly coupled to the mapped core.
In one implementation of the second aspect, the plurality of functions may be a plurality of executable programs of the software component in a runtime environment, and the plurality of shared variables may be inputs of the software component.
The method of the second aspect and its implementation may achieve the same advantages and effects as the device described in the first aspect and its implementation above.
A third aspect of the invention provides a computer program comprising instructions which, when executed by a computer, cause the computer to perform a method according to the second aspect or any implementation thereof.
A fourth aspect of the invention provides a non-transitory storage medium storing executable program code which, when executed by a processor, performs a method according to the second aspect or any one of its implementation forms.
It should be noted that all devices, elements, units and modules described in this application may be implemented by software or hardware elements or any type of combination thereof. All steps performed by the various entities described in this application and the functions described to be performed by the various entities are intended to indicate that the respective entities are used to perform the respective steps and functions. Although in the following description of specific embodiments, specific functions or steps performed by external entities are not reflected in the description of specific detailed elements of the entity performing the specific steps or functions, it should be clear to a skilled person that these methods and functions may be implemented by corresponding hardware or software elements or any combination thereof.
Drawings
The following description of specific embodiments sets forth aspects and implementations described above in connection with the accompanying drawings.
Fig. 1 shows an example of a device according to the invention.
Fig. 2 shows an example of variable partitioning according to the present invention.
Fig. 3 shows an example of hierarchical clustering according to the present invention.
Fig. 4 shows a diagram of a method according to the invention.
Fig. 5 shows a diagram of another method according to the invention.
Fig. 6 shows an example of a NUMA architecture.
Detailed Description
The present invention relates generally to software deployment optimization for multi-core computing systems (or simply multi-core systems). The present invention may be applied to a variety of multi-core systems, such as the NUMA architecture shown by way of example in FIG. 6.
Fig. 1 shows an example of a device 100 according to the invention.
Device 100 may be a software deployment tool adapted to map multiple functions sharing multiple variables to a multi-core computing system. The multi-core computing system includes a plurality of cores and a plurality of memories. Each core is directly coupled to memory, which may be referred to as LRAM. Alternatively, there may be at least one shared memory that is connected to the global system bus and shared by some or all of the cores. The shared memory may be referred to as GRAM.
For example, as shown on the right side of fig. 1, a multi-core computing system may include two cores (core 1, core 2) and three memories (LRAM 1, LRAM 2, GRAM). LRAM 1 and LRAM 2 are directly coupled to core 1 and core 2, respectively, while GRAM is shared globally between core 1 and core 2. The memory may have different characteristics such as, but not limited to, access time and size. For example, access time to memory is reduced by about an order of magnitude compared to a core directly connected to memory.
The left side of FIG. 1 illustrates an abstract example of functions to be performed on a multi-core computing system and variables shared between those functions. In the present invention, these functions may be referred to as executable programs. Variables may be understood as read data and write data associated with each function. In the present invention, the variable may be referred to as a tag.
Alternatively, the executable program may be part of one or more software components. The software components may be architectural elements that provide and/or require interfaces and are interconnected to fulfill architectural responsibilities. Alternatively, the software component may be for a runtime environment, such as AUTOSAR.
As can be seen from fig. 1, one variable may be dedicated to only one function (e.g., variable c and function 2), or may be used by two or more functions. Accordingly, a function may have only one variable (e.g., function 1 and variable a), or may have two or more variables.
The apparatus 100 is configured to assign each of the variables to one of the plurality of memories according to one or more characteristics of the plurality of memories to obtain an assignment of the plurality of variables to the plurality of memories. The apparatus 100 is also configured to map a plurality of functions to a plurality of cores based on the allocation of a plurality of variables to a plurality of memories.
It should be noted that, the step of "mapping a plurality of functions to a plurality of cores according to the allocation of a plurality of variables to a plurality of memories" in the present invention may also be understood as "mapping a plurality of functions to a plurality of cores in combination with the allocation of a plurality of variables to a plurality of memories". That is, the device considers one or more characteristics of the plurality of memories in mapping the variables. Since functions share variables, the mapping of variables and the mapping of functions may affect each other. That is, the device 100 may also be used to map variables to memory according to a function-to-core mapping. When the device 100 can obtain the optimal mapping of variables and the optimal mapping of functions, the entire process can be understood as an optimization process until the cost function is satisfied.
The advantage is that the resource utilization of the multi-core system may be optimized or the relaxation may be maximized.
Alternatively, the apparatus 100 may be used to determine an optimal mapping of functions to cores and an optimal mapping of variables to memory using binary integer linear programming (binary integer linear programming, BILP). In this way, the resource utilization of the multi-core system may be further optimized through BILP.
Alternatively, the device 100 may be used to obtain an application definition of a function (or software component) as one input for determining a software deployment. For example, the device 100 may be used to obtain information about a set of functions, as well as information about read data and write data (e.g., variables, tags) for each function, such as data size, latency requirements, dependencies, and the like.
Alternatively, the device 100 may be used to obtain the specification of a multi-core system as one input to determine the software deployment. For example, the specification of a multi-core system may include the number of cores and memory, the hierarchy of memory, and information about the cost of local memory (e.g., LRAM) and remote memory (e.g., GRAM) access.
Alternatively, the device 100 may be used to obtain real-time attributes of functions (or software components) with respect to the multi-core system as one input for determining software deployment. For example, the real-time attributes may include: execution time of each function (assuming its variable is in GRAM, or in memory with maximum access time); with its variable in its LRAM, the execution cycle caused by one call to a function increases. Real-time properties may be estimated or calculated using existing runtime performance analysis tools known in the art.
Fig. 2 shows an example of variable partitioning according to the present invention. Variable partitioning may be referred to as tag binding.
Alternatively, the device 100 of FIG. 1 may be used to divide variables with respect to multiple functions to obtain binding relationships between variables and functions. The device 100 may also be used to map multiple functions to multiple cores further according to binding relationships between variables and functions.
The variable (or tag) may represent a portion (possibly as small as a single bit) of memory that the executable program uses to communicate with each other. The tags are preferably allocated to the available memory areas (which may be either LRAM or GRAM) in an appropriate manner to improve communication efficiency. The impact of variable partitioning (or label binding) is the impact on the execution time of the runnable program. In fact, the access time to memory is reduced by about an order of magnitude compared to a core directly connected to memory.
In fig. 2, 7 executable programs 1 to 7 and 11 tags a to k are exemplarily shown. One executable program may share one or more tags with one or more other executable programs. For example, in a typical embedded application scenario, there may be one thousand executable programs and ten thousand tags in the automotive field. Thus, the unique formula of the joint mapping of labels and runnable programs is not easily handled. Thus, as an initial step, device 100 may, prior to mapping the plurality of functions to the plurality of coresFor binding the tag to the runnable program. The general principle of tag binding is that when runnable i maps to core k, all tags L bound to runnable i i May map to memory local to core k, optionally under conditions where the local memory is not overloaded.
The right side of fig. 2 graphically represents disjoint subsets L of L i Examples of the determination. It can be seen that each tag is bound to only one executable program, which is represented by a solid line. Alternatively, the bound tag may still be accessible to other runnable programs, but not to other runnable programs, as indicated by the dashed line. That is, the binding between the variables and the functions may be disjoint, meaning that each tag may be bound to at most one executable program.
The label binding problem can be expressed as follows. The device 100 may be used to obtain information about a set of N runable programs and a set of L tags as inputs. For each pair (i, L) ∈NxL, the device 100 may be used to:
-if tag l is used by executable i, determining g i,l Wherein g is i,l Indicating an increase in execution time caused by a call to executable i when tag l is allocated to memory local to the core that executable i executes,
-if the tag l is not used by the runnable program i, setting g i,l =0。
N, L, i, j and l are positive integers. The increase g i,l May depend on the size of the tag l and the number of accesses to the tag l by one call to the executable i.
For each executable i εN, f i Representing the maximum execution frequency of the executable program. The device 100 may be used to obtain the maximum execution frequency of each executable program as an input from a software provider or the like.
For label binding, the following concepts are introduced:
the purpose of binding the tag to the executable program is to minimize resource utilization. Assuming that each executable program can run at a frequency f i The metric to be maximized may be performed as shown in the following equation:
the principle of the cost function in equation 2 is to assign a tag l to the executable i to obtain a binding relationship between the tag and the executable, so that the multi-core system benefits the most in saving processing time. Equations 1 and 2 ensure that the execution frequency f is multiplied by the execution time savings i The maximum utilization gain represented.
Will S i Expressed as the size (or amount) of local memory (e.g., LRAM) allocated to executable program i, and will s l Denoted as the size of the tag l, the device 100 may also be used to apply the following constraints:
this ensures that the total memory local to the runnable program i is not overloaded. Furthermore, the apparatus 100 may also be used to apply the following constraints:
The aim is to ensure that the total size of the local memory is not overloaded, wherein,representing the total size of the local memory of core k.
The device may also be used to ensure that the tag is bound to at most one executable program by applying the following:
after finding the optimal solution, the disjoint subset L is defined by including all tags assigned to the runnable i i
Alternatively, tag l may be bound to the executable according to the following rules:
1. label basisDescending order
2. Selecting tags to bind to the runnable program in the order defined above until a constraint is reached (4)
3. Each tag selected, i, is bound to an executable j such that f j g j,l =max i f i g i,l
The device 100 is also used to map functions (or executable programs) to cores (or CPUs). Alternatively, the apparatus 100 may be used to obtain an optimal solution for mapping an executable program to a CPU according to a BILP problem. For mapping N runnable programs on M CPUs, variable y i,k The following conceptual modeling may be used:
each executable should only map to one CPU. This may be ensured by the following constraints:
alternatively, the device 100 may also be used to determine if any pair or executable program maps to the same CPU. If two runnable programs share the same CPU, then communication between them may be more efficient through LRAM of the same CPU, which is faster than through GRAM. To implement this condition into the BILP problem, variables The following conceptual modeling may be used:
furthermore, if two runnable programs i and j are bound to the same CPU k, thenMust be equal to 1, which is specified by:
before describing further the constraints of the mapping of the executable program to the CPU, the following notations are introduced.
First of all,representing the execution cycle of the executable i, assuming:
as described above, L bound to executable i i All tags in (a) are stored in the LRAM, so access is faster;
all other tags are stored in GRAM.
Second, ΔC i,j Indicating that if the executable program j is executed on the same CPUWhen this condition is met), then one call of executable i saves execution cycles. If the executable i uses L i This will occur if any of the tags in (a). ΔC i,j Written as gain g introduced before i,l Can be as follows
Again, C i,k And ≡ 0 represents the number of execution cycles required for the executable i to run on the CPU k, defined as follows:
note that if the executable i is not mapped to CPU k, C i,k Equal to zero.
Alternatively, when the apparatus 100 may be used to partition the tag with respect to the runnable program, the partitioned tag or tags bound to the runnable program i may be assigned to the LRAM of the runnable program i. Memory size S may be required in LRAM i To store the partitioned one or more tags. If the available size of the LRAM of the runnable program I is expressed asThe constraint of the limited size of the LRAM can be as shown in the following equation:
alternatively, the device 100 may be used to not overload any cores; this corresponds to a maximum utilization constraint, as follows:
an advantage of mapping an executable to a core based on a BILP problem may be that core utilization may be minimized according to a generic metric or per-core metric. That is, the mapping may be driven by relaxation maximization of all cores.
Alternatively, the device 100 may also be used to combine two or more functions of the same core into a task according to a release pattern of the two or more functions of the same core. For example, if two runnable programs have the same release pattern and do not hang themselves, then the two runnable programs map to the same CPU and they can be aggregated into the same task.
The aggregation of runnable programs can be as shown in the following formulas.
A set of tasks is represented asTo form task ∈1->A set of runnable programs is represented as In order to be able to run the program +.>Belongs to one and only one subset +. >Is formed by (a)>Is divided into (1).
Paired runnable programsEquivalent relationships-code aggregation of executable programs. That is, i to j means that "two runnable programs i and j have the same release pattern, and neither of them is self-suspended". />The runnable program belonging to the same task/can be defined as follows:
is a variable representing the optimal mapping.
Tasks according to equation 11The execution cycle of (a) is as follows:
tasksThe minimum inter-arrival time (or period) of (a) is:
it should be noted that all runnable programs in the same task may have the same cycle.
TasksThe expiration date of (2) is:
the task partitioning on the M CPUs may be defined as follows:
it should be noted that if for some executable programsy i,k =1, then for all +.>y i,k =1。
Optionally, the device 100 may also be used to assign priorities to each task to minimize the resource utilization of each core.
For assigning priorities to tasks, the priorities are only comparable within the same CPU, as the scheduling decisions within the CPUs are independent. Therefore, it must beTask allocation priority in->For the sake of brevity, the ++is omitted in all equations relating to priority allocation >
Priority between tasksA ranking is generated. More strictly speaking, the ordering of tasks by priority is the full ordering between disjoint subsets of tasks, the subsets of tasks representing tasks having the same priority. If the cardinality of all subsets is 1, then all tasks have different priorities and are ordered in full order.
The priority assignment may be performed by the following variablesModeling:
by variable amountsThe relationship between the modeled tasks is a partial order between the tasks. The order is partial because tasks may have the same priority. Variable->Is defined to specify the ordered attributes. These constraints include:
the reflexibility is achieved by omitting the variablesImplicitly, because the variable is always +.>
Transitivity, specified by the following constraints
-antisymmetry, specified explicitly by the following equation
It should be noted that if all tasks are required to have different priorities, and thus the priority order is full, the antisymmetric constraint is replaced with
Conversely, if the support tasks have the same priority, it may be necessary to explicitly specify equivalence relations between the tasks having the same priority. In fact, not all partial orders correspond to valid priority assignments.
In order to specify equivalence relationships between tasks of incomparable priority, the following is definedVariable(s)This variable encodes two tasks i and j with the same priority:
it should be noted that the number of the substrates,the parameters in the above functions of (a) are not independent, as they were defined by previous functions. Nevertheless, it is convenient to introduce them to have a more compact symbol for defining constraints below.
The task pair i and j (where,) The equivalence between them is specified as follows:
by omitting variablesImplicitly specifying;
-at i only<j, symmetry is defined byImplicitly specifying;
transitivity is specified explicitly by:
alternatively, the device 100 may be used to assign a limited number of priorities to tasks. For example, if the number of priorities is limited by the number P, then the priority may be limited by the limitSum of variables to defineThe constraint. For example, if p=2, then:
this means that if the priority of the runnable program i is higher than any runnable program j, the priority of runnable program j cannot be higher than any other runnable program i. Otherwise, three priority levels are required.
Alternatively, each executable program may be an expiration date D i . If not explicitly set, the implicit expiration date of the runnable program i is equal to the period T defined in equation 18 i . The deadline constraint for all tasks on a given CPU is written as:
C i +I i +B i ≤D i (26)
wherein:
-C i representing the worst-case execution time of the executable itself,
-I i representing interference caused by higher or same priority runnable programs,
-B i representing the blocking time, i.e., the time it takes for the runnable i to wait for some resources (e.g., one or more tags used by other runnability).
Execution time C of executable program i i Given by equation 17. Runnable programInterference I experienced i Written as a linear combination of the following decision variables:
variables if and only if task i has the same priority as task jEqual to1. Similarly, the variable +.>Equal to 1. In equation 29, two contributions to the interference are introduced.
The first is to execute tasks with the same priority. Tasks with the same priority are scheduled in a first-in-first-out (FIFO) order, the contribution being equal to the sum of the execution cycles of all tasks with the same priority, except for task i itself. The second is to execute a task with a higher priority. Here we use a overexposure of the interference which fully accounts for the spacing [0, D i ) The execution cycle of all higher priority tasks released. Interference I i The exact expression of (c) requires evaluation of the minimum of several linear expressions. However, this minimum breaks the convexity of the feasible region, which makes the optimization problematic. Thus, I of equation 29 i Is used only to assign priorities.
Blocking time B i Is the time it takes for task i to be in a "waiting" state, which may result from an attempt to access a shared resource that is locked by any low priority task in the same CPU or any task executing on other CPUs. The wait state may also be caused by a call blocking a system call (e.g., a remote procedure call). Blocking time B i Is a linear function of the decision variables and is therefore well suited to priority assignment problems.
Alternatively, the device 100 may determine the optimization objective. The optimization objective is to maximize the extensibility of the application deployment, which can be understood as having as much "space" as possible to accommodate new functions in the future. From the constraints introduced previously, for each constraint, the following normalized form is provided:
the linear combination of the binary variables is less than or equal to 1 (28)
Then, an additional variable z representing the "space" reserved for future extensions is defined, and then all constraints are modified, as follows
Linear combination of binary variables +alpha.z.ltoreq.1 (29)
Where α ε [0,1] represents the amount of constraint-related slack expected in the constraint. The larger the value of α, the more relaxed the target in the constraint.
The objective of the optimization problem is then:
"maximize z" (30)
Found optimal z * Representing the amount of slack in the constraint.
The optimization problem described above may require a large number of parameters to be considered and the computational cost required to evaluate the optimization may be significant. Thus, the device 100 may be used to perform hierarchical clustering of runnability programs and tags.
The apparatus 100 may be used to implement a greedy process to group runnable programs that are likely to be on the same core in an optimal allocation. The device 100 may be used to evaluate the likelihood when determining the group maximization based on equation 2.
For example, groupings of executable programs can be readily implemented by hierarchical clustering, particularly by an aggregation method that can build larger and larger groups of executable programs from bottom to top. Alternatively, the distance concept is replaced by a similarity measure, and if two runnable programs run on the same core and their assigned tags are located in the local memory of the same core, the value is equal to the gain that the two runnable programs will have. The similarity concept can be automatically extended to the case where the executable to be joined is already a group of executable programs, since all tags of the executable assigned to group participants are used to evaluate the joint similarity value.
The advantage of using hierarchical clustering is that the entire hierarchical tree is built up in the process and the tree can be cut to any level of choice, thus having an algorithm that groups and partitions the runnable programs into any number of clusters. This may also be added by parallel evaluation of the period and memory required for each cluster to run, for example, to provide a stop (or "branch") condition for the clustering process when the memory required for one cluster exceeds a predefined amount.
Since hierarchical clustering is a greedy, imprecise approach, it can help reduce the number of runnable programs that are then submitted to the optimization problem. Thus, the complexity of the optimization problem can be reduced.
Fig. 3 shows an example of hierarchical clustering according to the present invention. The device 100 may be used to group the runnable programs 1 to 7 into a plurality of clusters 310, 311, 320, 321, 322. Cluster 311 is a branch (or sub-cluster) of cluster 310, and clusters 321 and 322 are branches of cluster 320. The apparatus 100 may be used to adjust the number of clusters according to the complexity of the optimization problem. For example, if, after evaluation, a maximum of three clusters are supported, device 100 may use clusters 310, 321, and 320 as the basis for optimization.
Fig. 4 shows a diagram of a method 400 according to the invention. Method 400 is performed by a device for mapping a plurality of functions sharing a plurality of variables to a multi-core computing system. The multi-core computing system includes a plurality of cores and a plurality of memories, wherein each core is coupled to a memory of the plurality of memories. The method 400 includes the steps of:
-step 401: the device allocates each of the variables to one of the plurality of memories according to one or more characteristics of the plurality of memories to obtain allocation of the plurality of variables to the plurality of memories;
-step 402: multiple functions are mapped to multiple cores based on the allocation of multiple variables to multiple memories.
From the perspective of fig. 1-3 described above with respect to device 100, the steps of method 400 may have the same functionality and details. Accordingly, at this point, the corresponding method implementation will not be described again.
Fig. 5 shows a diagram of a method 500 according to the invention. The method 500 is based on the method 400 of fig. 4, comprising the steps of:
-step 501: the device of FIG. 4 binds a tag to an executable program;
-step 502: the device maps the executable program to the core;
-step 503: the device maps the tag to the memory;
-step 504: the device maps the runnable program to a task;
-step 505: the equipment allocates priority to the task;
-step 506: the device generates constraints of software deployment optimization problems;
-step 507: the device solves the optimization problem according to the target function.
Step 501 is optional and step 501 may have the same functionality and details from the perspective of the tag binding problem involved in fig. 1-3 and equations 1-5.
Step 502 corresponds to step 402 of fig. 4, and step 502 may have the same features and details from the perspective of fig. 1-3 described with respect to equations 6-13.
Step 503 corresponds to step 401 of fig. 4 and may likewise have the same features and functions. It should be noted that steps 502 and 503 may be performed in no strict order.
Step 504 is optional, and step 504 may have the same features and details from the perspective of fig. 1-3 described with respect to equations 14-18.
Step 505 is optional, and step 505 may have the same features and details from the perspective of fig. 1-3 described with respect to equations 19-25.
From the perspective of fig. 1-3 described with respect to equations 26 and 27, step 506 may have the same features and details.
From the perspective of fig. 1-3 described with respect to equations 28-30, step 507 may have the same features and details.
In addition to mapping multiple functions to multiple cores, tag-to-memory mapping is provided in the present invention. The method has the advantage that the resource utilization rate of the computing system can be optimized. The invention can be applied to the embedded application deployment on the multi-core architecture system. The embedded application may include a software component that includes an executable program that executes in a runtime environment. The runnable program is sensitive to resources. By correctly mapping the shared tag into memory (LRAM or GRAM) according to the BILP problem, overall system performance can be improved.
Alternatively, by binding the tags to the runnable program and using hierarchical clustering, the number of parameters used to solve the BILP problem can be reduced by about an order of magnitude. This may make the BILP problem easy to handle.
Alternatively, by assigning priorities to tasks formed by a set of runnable programs mapped to the same core, the relaxation of all cores can be maximized.
Another advantage of the present invention is that the equations provided in the present invention are all expressed as linear constraints between parameters. Thus, the complexity of the BILP problem is reduced.
It should be noted that the method for executing software deployment disclosed in the present invention may be implemented in any programming language and may be used for execution on any hardware platform.
The invention can be applied to any multi-core or heterogeneous hardware platform where real-time guarantees should be adhered to. The present invention may have the following advantages.
By using optimization techniques, more assurance of the attributes obtained by the software application deployment may be ensured. In addition, available resources, such as CPU resources and memory resources, are utilized more efficiently. The relaxation of resources can be maximized. For example, in the event of a failure, real-time (or reliability) assurance may be ensured. Furthermore, resources may be reserved for new functions in the future and the same configuration may be reused.
The present invention may be applied to any software deployment tool, such as an AUTOSAR system configuration tool. The invention can improve the product quality and realize the automation of the software deployment process.
In this aspect, the device may include a processor or processing circuitry (not shown) for performing, conducting, or initiating various operations of the device described herein. The processing circuitry may comprise hardware and/or the processing circuitry may be controlled by software. The hardware may include analog circuits or digital circuits, or both analog and digital circuits. The digital circuitry may include an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a digital signal processor (digital signal processor, DSP), or a multi-purpose processor. The device may also include memory circuitry that stores one or more instructions that may be executed by the processor or processing circuitry (e.g., under control of software). For example, the memory circuit may include a non-transitory storage medium storing executable software code that, when executed by a processor or processing circuit, causes the device to perform various operations. In one embodiment, a processing circuit includes one or more processors and a non-transitory memory coupled to the one or more processors. The non-transitory memory may carry executable program code that, when executed by one or more processors, causes the apparatus to perform, carry out, or initiate the operations or methods described herein.
The invention has been described in connection with various embodiments and implementations as examples. Other variations can be understood and effected by those skilled in the art in practicing the claimed subject matter, from a study of the drawings, the invention, and the independent claims. In the claims and in the description, the word "comprising" does not exclude other elements or steps, and the "a" or "an" does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims (17)

1. An apparatus (100) for mapping a plurality of functions sharing a plurality of variables to a multi-core computing system, the multi-core computing system comprising a plurality of cores and a plurality of memories, each core coupled to a memory of the plurality of memories, the apparatus being for:
assigning each of the variables to one of the plurality of memories according to one or more characteristics of the plurality of memories to obtain an assignment of the plurality of variables to the plurality of memories;
the plurality of functions is mapped to the plurality of cores according to the allocation of the plurality of variables to the plurality of memories.
2. The device (100) of claim 1, wherein the one or more characteristics of the plurality of memories include an access time for each memory.
3. The device (100) according to claim 1 or 2, further being adapted to:
dividing the plurality of variables relative to the plurality of functions to obtain binding relations between the plurality of variables and the plurality of functions;
the plurality of functions is further mapped to the plurality of cores according to the binding relationship between the plurality of variables and the plurality of functions.
4. A device (100) according to claim 3, characterized in that, in order to divide the variables, the device is adapted to, for each function:
obtaining a frequency at which the function is performed;
one or more of the plurality of variables are associated with the function according to the frequency of the function.
5. The device (100) of claim 4, wherein the one or more characteristics of the plurality of memories include a size of each memory, the device being configured to associate each variable further based on the size of each variable and the size of each memory.
6. The device (100) of any of claims 1 to 5, wherein the device is configured to map the plurality of functions to the plurality of cores further according to a number of cycles of the function required to perform each function at each core.
7. The device (100) according to any one of claims 1 to 6, wherein the device is configured to map the functionality to the plurality of cores in such a way that each core is not overloaded.
8. The device (100) according to any one of claims 1 to 7, further for combining two or more functions of a same core into a task according to a release pattern of the two or more functions of the same core.
9. The device (100) of claim 8, further configured to assign a priority to each task to minimize resource utilization of each core.
10. The apparatus (100) of claim 9, wherein the priority for each task is determined based on an expiration date of the task.
11. The apparatus (100) of claim 10, wherein the priority for each task is determined further based on interference caused by one or more other tasks.
12. The apparatus (100) of claim 11, wherein the priority of each task is determined further based on a blocking time the task is in a waiting state.
13. The device (100) according to any one of claims 3 to 12, wherein to map the plurality of functions to the plurality of cores, the device is to:
Grouping two or more functions into a single clustered function, wherein the two or more functions share one or more common partitioning variables;
mapping the single cluster function to one of the plurality of cores.
14. The device (100) of claim 13, wherein the plurality of memories includes a global memory and a plurality of local memories, the global memory being shared by the plurality of cores, each local memory being directly coupled to one of the plurality of cores, the device further configured to allocate the one or more common partition variables associated with the clustering function to the local memory directly coupled to the mapped core.
15. The device (100) of any of claims 1 to 14, wherein the plurality of functions are a plurality of executable programs of a software component in a runtime environment, the plurality of shared variables being inputs to the software component.
16. A method (400) for mapping a plurality of functions sharing a plurality of variables to a multi-core computing system, the multi-core computing system including a plurality of cores and a plurality of memories, each core coupled to a memory of the plurality of memories, the method comprising:
The device allocates (401) each of the variables to one of the plurality of memories according to one or more characteristics of the plurality of memories to obtain an allocation of the plurality of variables to the plurality of memories;
the device maps (402) the plurality of functions to the plurality of cores according to the allocation of the plurality of variables to the plurality of memories.
17. A computer program product comprising instructions which, when executed by a second computer, cause the second computer to perform the method of claim 16.
CN202280005934.7A 2022-05-23 2022-05-23 Software optimization method and equipment of NUMA architecture Pending CN117441161A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2022/063829 WO2023227187A1 (en) 2022-05-23 2022-05-23 Software optimization method and device for numa architecture

Publications (1)

Publication Number Publication Date
CN117441161A true CN117441161A (en) 2024-01-23

Family

ID=82117403

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202280005934.7A Pending CN117441161A (en) 2022-05-23 2022-05-23 Software optimization method and equipment of NUMA architecture

Country Status (2)

Country Link
CN (1) CN117441161A (en)
WO (1) WO2023227187A1 (en)

Also Published As

Publication number Publication date
WO2023227187A1 (en) 2023-11-30

Similar Documents

Publication Publication Date Title
US11562213B2 (en) Methods and arrangements to manage memory in cascaded neural networks
Becker et al. Contention-free execution of automotive applications on a clustered many-core platform
US8510749B2 (en) Framework for scheduling multicore processors
RU2571366C2 (en) Virtual non-uniform memory access architecture for virtual machines
US9104453B2 (en) Determining placement fitness for partitions under a hypervisor
US8676976B2 (en) Microprocessor with software control over allocation of shared resources among multiple virtual servers
CN109669772B (en) Parallel execution method and equipment of computational graph
US11656908B2 (en) Allocation of memory resources to SIMD workgroups
JP2013500543A (en) Mapping across multiple processors of processing logic with data parallel threads
US9684614B2 (en) System and method to convert lock-free algorithms to wait-free using a hardware accelerator
US8291426B2 (en) Memory allocators corresponding to processor resources
CN107957965B (en) Quality of service ordinal modification
US11934698B2 (en) Process isolation for a processor-in-memory (“PIM”) device
Yang et al. PK-OMLP: An OMLP based k-exclusion real-time locking protocol for multi-GPU sharing under partitioned scheduling
US11954419B2 (en) Dynamic allocation of computing resources for electronic design automation operations
CN115878333A (en) Method, device and equipment for judging consistency between process groups
CN117441161A (en) Software optimization method and equipment of NUMA architecture
US20220206839A1 (en) Address mapping-aware tasking mechanism
GB2572248A (en) Resource allocation
Jungklass et al. Memopt: Automated memory distribution for multicore microcontrollers with hard real-time requirements
EP3343370A1 (en) Method of processing opencl kernel and computing device therefor
US20210042159A1 (en) Delegation control
Mancuso Next-generation safety-critical systems on multi-core platforms
Xia et al. Collaborative scheduling of dag structured computations on multicore processors
GB2578998A (en) Resource allocation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination