CN112015692A

CN112015692A - Simulink-oriented inter-core communication optimization method for automatically generating multi-thread codes

Info

Publication number: CN112015692A
Application number: CN202010698129.XA
Authority: CN
Inventors: 汪楠; 柳宜川; 邱源; 许博仁
Original assignee: Shanghai aerospace computer technology research institute; East China University of Science and Technology
Current assignee: Shanghai aerospace computer technology research institute; East China University of Science and Technology; Shanghai Academy of Spaceflight Technology SAST
Priority date: 2020-07-21
Filing date: 2020-07-21
Publication date: 2020-12-01

Abstract

The invention provides an interprocess communication optimization method for automatically generating a multi-thread code facing Simulink, which adopts the technology of combining static analysis and dynamic simulation to effectively allocate a communication cache region so as to further reduce the synchronization cost and improve the utilization rate of a processor. FPGA simulation is introduced in the inter-processor cache region allocation flow. In addition, the invention allocates proper amount of entries for different communication buffer areas under fixed memory overhead so as to minimize synchronous waiting time and thread switching time. The invention introduces an optimization method into the Simulink-based code generation process, and reduces the communication overhead and improves the system performance by combining with the communication pipeline technology. In terms of allocating communication buffer, a technology combining static analysis and dynamic simulation is adopted to further reduce the synchronization cost and improve the utilization rate of the processor.

Description

Simulink-oriented inter-core communication optimization method for automatically generating multi-thread codes

Technical Field

The invention relates to the technical field of communication, in particular to an interproximal communication optimization method for automatically generating a multi-thread code facing Simulink.

Background

The communication frequency is increasing with the increasing complexity of emerging embedded applications and the increasing number of processors in the implemented multi-processor SoC architecture. Software development of multiprocessor systems on chip (MPSoC) involves a great deal of work, such as adjusting communication between concurrent threads and avoiding deadlocks, manually adjusting code to accommodate different types of processors and communication protocols, and distributing code and data between processors. Automated techniques may help designers to address these difficulties and find a satisfactory solution.

The Simulink system model is a widely used code automatic generation model, and has two architectures, including a software thread and a hardware architecture. The Simulink system model is specified as a three-layer hierarchy: a system layer, a subsystem layer and a thread layer; the system layer describes the system architecture consisting of the CPU subsystems and the inter-subsystem communication channels. The subsystem layer describes a CPU subsystem architecture that includes a set of threads and subsystem internal communication channels between them. The thread layer describes a software thread consisting of Simulink blocks and links between them.

A Simulink link is a one-to-many link that connects an output port of one block to one or more input ports of other blocks. The three-layer model of the Simulink system model helps the code generator to clearly distinguish different types of Simulink links, thereby supporting inter-core communication optimization techniques.

In the automatic generation of Simulink-based code, the difference in task mapping and complexity between different processors results in a difference in computation duration. All of this can block some threads that are sending or waiting for messages and cause frequent thread switches.

In general, buffers are always used to coordinate the rate gaps between message-generating ports and using ports. A communication buffer is a buffer with multiple entries between the send blocks and the corresponding receive blocks of different threads. It allows buffering of messages over multiple cycles.

The main problem of current automatic code generation techniques is how to avoid synchronization latency and parallelize the computational operations of different processors as much as possible. In order to achieve good scalability with an increasing number of processors, it is necessary to specify an appropriate number of threads that can run in parallel so that all processors can be fully utilized. Fine-grained threads may result in frequent synchronization and result in longer latency in the processor. In order to increase the utilization of the processor, it is necessary to find a way to have more threads work simultaneously.

Disclosure of Invention

In view of the above drawbacks of the prior art, an object of the present invention is to provide a method for optimizing inter-core communication for Simulink-oriented automatic generation of multithreaded code, which is used to solve the problems in the prior art.

In order to achieve the above objects and other related objects, the present invention provides a Simulink-oriented method for optimizing inter-core communication for automatically generating multi-threaded codes, comprising the steps of:

setting the buffer depth of all communication vectors in the communication vector set C to be 1;

for a certain thread T in the thread set T, when the minimum cache depth related to the certain thread T is increased by a unit amount, calculating the memory usage amount M (T) of the thread, and setting a temporary thread set before calculation

Determining whether the memory usage M (t) of the thread exceeds the available memory amount M_avl(ii) a If the memory usage M (t) is greater than M_avlDirectly switching to the next thread when T' is empty; if the memory usage M (t) does not exceed M_avlCalculating the total thread switching time S (T), making T ═ T { [ T }, and selecting the thread with the minimum M (T) } S (T) when T' is not empty;

allocating the communication buffer area inlet to a communication vector so as to increase the minimum buffer area depth by 1, and then judging whether all threads are processed completely; if not, switching the next thread; and if the processing is finished, finishing the distribution optimization.

Optionally, before performing iterative optimization in the processor, determining a cache depth of an inter-processor communication vector according to the maximum cache usage amount;

allocating an internal cache region of the processor;

performing inter-processor cache region analysis;

introducing FPGA simulation in the inter-processor cache region allocation flow, and judging whether the thinning of the processors is finished; if the thinning of the processors is finished, updating a cache region between the processors; if the thinning of the processor is not finished, judging whether the iteration is finished;

if the iteration is not finished, thinning the processor again; if the iteration is finished, the allocation of the cache region among the processors is finished.

Optionally, when performing FPGA simulation, inserting software traces in the hardware-related software for monitoring the states of the processor, the thread, and the buffer is further included.

Optionally, in the buffer optimization process, if the target throughput is reached, or the maximum number of iterations is reached, or the number of consecutive rejections exceeds a certain threshold, the iteration is terminated.

Optionally, the method further comprises:

giving a graph G ═ T, C, where T is a set of threads and C is a set of communication vectors between different threads; if each Thread in the graph can reach other threads through a wired path, G is called a Thread Strongly Connected Component (TSCC);

giving a graph G ═ (B, R), where B is the set of blocks and R is the set of dependencies between different blocks in the thread; if each Block in the graph can reach other blocks through wired paths, G is called a Block Strongly Connected Component (BSCC);

a communication buffer suitable for a task topology map with circular dependency is set according to TSCC and BSCC.

As described above, the invention provides an interprocess communication optimization method for Simulink to automatically generate a multi-thread code, which has the following beneficial effects:

the invention concentrates the communication vector into a set CInitializing the depth of a cache region of all communication vectors to be 1; for a certain thread T in the thread set T, when the minimum cache depth related to the certain thread T is increased by a unit amount, calculating the memory usage amount M (T) of the thread, and setting a temporary thread set T' ═ phi before calculation; judging whether the memory usage amount M (t) of the thread exceeds the corresponding available memory amount M_avl(ii) a If the memory usage M (t) exceeds M_avlDirectly switching to the next thread when T' is empty; if the memory usage M (t) does not exceed M_avlCalculating the total thread switching time S (T), making T ═ T { [ T }, and selecting the thread with the minimum M (T) } S (T) when T' is not empty; allocating a cache region inlet for the communication vector of the communication vector to increase 1 minimum cache region depth, and judging whether all threads are processed completely; if not, switching the next thread; and if the processing is finished, finishing the distribution optimization. The invention adopts the technology of combining static analysis and dynamic simulation, and effectively allocates the communication cache region so as to further reduce the synchronization cost and improve the utilization rate of the processor. In the inter-processor cache region allocation process, the solution space is explored in an iterative mode, and the local optimal solution of the algorithm is selected by using the simulated annealing algorithm. In order to quickly and accurately obtain a simulation result, FPGA simulation is introduced into an inter-processor cache region allocation flow. In addition, under the condition of fixed memory overhead, the invention allocates proper number of entries for different communication buffer areas so as to minimize synchronous waiting time and thread switching time. The invention introduces an optimization method into the code generation flow based on Simulink, and reduces the communication overhead and improves the system performance by combining with the communication pipeline technology; in addition, on the aspect of allocating the communication buffer area, a technology combining static analysis and dynamic simulation is adopted to further reduce the synchronization cost and improve the utilization rate of the processor.

Drawings

FIG. 1(a) is a diagram of a DMA model according to an embodiment;

FIG. 1(b) is a diagram of an embodiment of a shared memory model shown in FIG. 1 (a);

FIG. 1(c) is a thread code diagram of FIGS. 1(a) and 1 (b);

FIG. 2 is a schematic diagram illustrating an allocation procedure of a communication buffer in a processor according to an embodiment;

FIG. 3 is a pseudo code diagram of the communication buffer algorithm of FIG. 2;

FIG. 4 is a flow diagram illustrating an allocation process of a communication buffer between processors according to an embodiment;

FIG. 5 is a schematic diagram illustrating an embodiment of an inter-core communication buffer allocation based on FPGA simulation;

FIG. 6(a) is a schematic view of a thread loop model according to an embodiment;

FIG. 6(b) is a thread code diagram provided in accordance with an embodiment;

FIG. 6(c) is a schematic diagram illustrating an execution sequence of a CPU with two entries allocated to a cache according to an embodiment;

FIG. 7(a) is a schematic view of a thread loop model according to another embodiment;

FIG. 7(b) is a thread code diagram for an implementation with k 2;

fig. 7(c) is a schematic view of the processing structure of fig. 7 (b).

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

Referring to fig. 1 to 7, the present invention provides a Simulink-oriented inter-core communication optimization method for automatically generating a multi-thread code, including the following steps:

Judging whether the memory usage amount M (t) of the thread exceeds the corresponding available memory amount M_avl(ii) a If the memory usage amount M (t) exceeds the available memory amount M_avlDirectly switching to the next thread when T' is empty; if the memory usage amount M (t) does not exceed the available memory amount M_avlCalculating the total thread switching time S (T), making T ═ T { [ T }, and selecting the thread with the minimum M (T) } S (T) when T' is not empty;

allocating a communication buffer area inlet for the communication vector of the communication vector to increase 1 minimum buffer area depth and judging whether all threads are processed completely; if not, switching the next thread; and if the processing is finished, finishing the distribution optimization.

The invention introduces an optimization method into the Simulink-based code generation process, and reduces the communication overhead and improves the system performance by combining with the communication pipeline technology. In terms of allocating communication buffer, a technology combining static analysis and dynamic simulation is adopted to further reduce the synchronization cost and improve the utilization rate of the processor.

The model in fig. 1 is used to illustrate the idea of inter-processor and intra-processor communication buffers, respectively. FIG. 1(c) shows the same thread code for both models.

For the model in FIG. 1(a), F₀And F₁May be performed in parallel. If F₀And F₁Is fixed in execution time, and F₀Is shorter than F₁Then S is₀Is always higher than R₀The reception rate of. Thus, the buffer will be full for a certain period of time, from which point on the system will work as if the buffer did not exist. Also, if F₀Is longer than F₁The buffer will always be empty. Therefore, if the communication rate is constant, the inter-processor communication buffer is not very useful. Unfortunately, the actual interest rate is not always constant. Since different stimuli cause different branches, and further lead the thread scheduling results to be different, the rate has strong dependency on the stimuli. The interprocessor communication buffer can adapt to the dynamic communication speed, and reduce the synchronous waiting time.

For the model in fig. 1(b), in the case where the communication buffer allocates one entry, the average number of thread switches performed in one iteration is 2, and in the case where the communication buffer allocates two entries, the average number of thread switches is reduced to 1. It can be concluded that with a communication buffer of N entries, the thread switch time can be reduced to 1/N.

Based on the above discussion, it can be concluded that: deeper buffers may reduce synchronization latency and thread switch time. Ideally, the communication buffer depth is greater than its total number of iterations. However, this ideal case is not achievable since the available memory size for a given hardware platform is not infinite. The present invention therefore provides a technique for allocating an appropriate number of entries to different communication buffers at a fixed memory overhead to minimize synchronization latency and thread switching time.

Since global memory is always much larger than local memory, the depth of the interprocessor communication cache is assumed to be infinite at this step. Next, an inter-processor communication buffer is allocated based on the result of the first step.

For the allocation of the cache region in the processor, the method comprises the following steps:

during in-processor cache allocation, the optimization goal is how to reduce the thread's switching time. To better explain the working principle of cache allocation in a processor, the following basic symbols are explained:

-T is a thread set;

-C is a set of communication vectors;

—C_tis a set of in-processor communication vectors associated with a thread T, where T e T;

depth is the buffer depth of the communication vector C, where C ∈ C;

size is the entry communication buffer size of the communication vector C, where C ∈ C. For a merged traffic vector, its magnitude is the sum of the original vectors;

s is the sum of the average thread switch times of all threads T in the thread set T during one iteration of execution.

A flow chart of an intra-processor cache entry allocation algorithm (labeled algorithm 1) is shown in fig. 2, and pseudo code is shown in fig. 3. The algorithm initially allocates an entry for each buffer. Considering that a thread is associated with many communication vectors, let N be the minimum depth of the communication buffer associated with it. The thread may perform up to N iterations in succession before the thread switches. Thus, the average value of the thread switching time during one iteration is 1/N. Thus, initially the average switch time per thread is 1, and the sum is | T | (a function of the number of computing elements). It then starts allocating traffic vectors until the memory is exhausted. First, for each thread t, it calculates the memory usage m (t) when the minimum cache depth associated with it increases by 1. If the memory usage does not exceed the available memory (M)_avl) Then the total thread switch time s (t) will also be calculated based on the previous allocation. Calculated, it selects M (t) not exceeding M_avlAnd m (t) s (t) the smallest thread, then allocates a communication buffer entry for its communication vector to increase by 1 the minimum buffer depth, and records the current result as the basis for the next allocation.

Given the amount of memory available M_avlThe number of entries that can be allocated to a communication vector is fixed. Therefore, after a plurality of iterations, increasing the depth of the communication vector by one in any thread will result in the total memory allocatedNumber exceeds M_avl. In this case, no one traffic vector can allocate more buffers, so T' is empty and the algorithm terminates.

The proposed inter-processor buffer allocation procedure is presented in fig. 4, which uses a simulated annealing algorithm in its iterations to select a locally optimal solution. In order to quickly and accurately obtain a simulation result, FPGA simulation is introduced into an inter-processor cache region allocation flow.

Before the iteration starts, a static method needs to be employed to calculate the depth of the inter-processor cache (step 1 in fig. 4). In this step, the maximum buffer usage for each inter-processor communication vector may be obtained. To analyze buffer usage without executing an application, it will first be assumed that each block has a fixed execution time. In conjunction with a given scheduling policy, the execution order of each processor may be determined without executing the application, and the buffer depth of the inter-processor communication vector may be determined based on the maximum buffer usage.

The FPGA simulation process then begins. If the throughput is optimized, receiving the optimization of the buffer area; otherwise the acceptance probability is e (T)_f-T_c)/(T_f-T_t)(T_f、T_cOptimizing throughput before and after buffer, T_tA target throughput defined for the designer). If cache optimization is accepted, the cache declaration will be updated.

The FPGA-based simulation process is shown in FIG. 5. At HdS (hardware dependent software) a software trace is inserted to monitor the state of the processor, threads and buffers. A thread will be blocked if it attempts to send a message to a full buffer or receive a message from an empty buffer (referred to as an artificial block and a true block, respectively). True blocks occur regardless of cache depth, whereas artificial blocks may be eliminated if more cache entries have been allocated. The total duration and total throughput of the artificial blocks for each communication vector can be derived from the software traces.

In the buffer optimization process, more buffer entries are attempted to be allocated to communication vectors whose artificial blocks are longer in duration. The iteration terminates if the target throughput is reached or the maximum number of iterations is reached, or if the number of consecutive rejections exceeds a certain threshold.

The above described method of communication buffer allocation can reduce synchronization latency and thread switching time, but it is not applicable to task topologies with circular dependencies. The above situation is illustrated by the models in fig. 6(a) and 6(b), where T₀And T₁A dependency loop is formed and the number of available communication buffer entries does not always exceed one, even a plurality of entries are allocated to the communication buffer. FIG. 6(c) shows a CPU₀Including thread switching. In execution, only one entry is used, although each cache can be allocated multiple entries. The situation is similar if the threads in the dependent loop are from different processors.

In a task topology with circular dependency, the reason for blocking the communication buffer is that at most one entry is available even if multiple entries are allocated. If at least one entry is available, the communication buffer is available for the task topology with circular dependency.

taking the model of FIG. 7(a) and the corresponding code of FIG. 7(b) as an example, assume T₀And T₁Mapped on the same processor, shown in FIG. 7(c) is the processing result of the code in FIG. 7 (b). The number of delays in this example is k, so F₀And S₀Can be preprocessed (k-1) times at most, so that S is performed while (1) is being performed₀After which k entries are available. If the communication vectorS₀->R₀And S₁->R₁Being assigned k entries, the number of thread switches will be reduced to 1/k. The communication buffering technique discussed above is for one block cycle, which may cause the allocation of buffer areas in the BSCC to be unreasonable. Taking the model in FIG. 7(a) as an example, assume T₀、T₁And T₂Mapping to the same processor and applying BC first₂(F₀->S₀->R₀->F₁->S₁->R₁->F₂->F₃->S₂->R₂->F₄->S₃->R₃->F₅). Due to BC₂In which there are three delay periods, V₀(S₀->R₀)、V₁(S₁->R₁)、V₂(S₂->R₂) And V₃(S₃->R₃) Three buffer entries are allocated. Then apply BC₁，BC₁V in₀And V₃Is allocated three entries, however due to BC₁Only two delay periods, and therefore, an improper entry assignment occurs.

In view of the above, the present invention modifies the previously mentioned in-processor cache allocation algorithm (algorithm 1) to allocate a reasonable number of entries for a cache with a circularly dependent task topology. 1) Each communication vector is set to the maximum depth: the number of allocated entries should be less than the maximum depth; for all traffic vectors in the BSCC, its maximum depth is equal to the number of minimum delay periods for all block periods in the BSCC. 2) Thread switch time in TSCC: the average number of thread switches depends on the minimum depth of all buffers that the TSCC is associated with, but not the minimum depth of the buffers that the threads are associated with. Therefore, if the minimum depth of the cache associated with a thread in the TSCC is greater than the minimum depth of all caches, the thread cannot be selected.

For the above reasons, it is necessary to add a constraint for the traffic vectors in the BSCC before the do-while loop of the previously proposed intra-processor cache entry allocation algorithm (algorithm 1):

for threads in TSCC, row 6 and row 7 of algorithm 1 will be modified as follows:

through the modification, the cache region entry allocation algorithm with the circularly dependent task topological graph can be obtained.

Firstly, the invention provides a Simulink-oriented inter-core communication optimization method for automatically generating multi-thread codes, which introduces an optimization method based on simulated annealing into a Simulink-based code generation flow and combines with a communication pipeline technology to reduce communication overhead and improve system performance. Secondly, the invention also adopts the technology of combining static analysis and dynamic simulation, and effectively allocates the communication buffer area so as to further reduce the synchronization cost and improve the utilization rate of the processor. An inter-processor cache allocation flow is proposed to quickly complete the search of an algorithm solution space by using a simulated annealing algorithm to select a locally optimal solution. Introducing FPGA simulation in the inter-processor cache region allocation flow, and judging whether the thinning of the processors is finished; if the thinning of the processors is finished, updating a cache region between the processors; if the processor does not refine completely, judging whether iteration is finished. Finally, the present invention also allocates an appropriate number of entries to different communication buffers at a fixed memory overhead to minimize synchronization latency and thread switching time.

The invention also provides a communication cache region allocation optimization system for multi-thread code automatic generation, which comprises:

for a thread T in the thread set T, when the minimum buffer depth related to the thread T is increased by a unit amount, the internal of the thread is calculatedStoring the usage M (t), and setting a temporary thread set before calculation

allocating an entry to a communication vector of the communication vector so as to increase 1 to the minimum cache depth and judging whether all threads are processed completely; if not, switching the next thread; and if the processing is finished, finishing the distribution optimization.

The system executes the method, and the implementation steps and functions of the method are referred to in the method, and the system is not described in detail.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. An interprocess communication optimization method for automatically generating multithreading codes facing Simulink is characterized by comprising the following steps:

allocating the communication cache region inlet to the communication vector of the communication cache region to increase the minimum cache region depth by 1 and judging whether all threads are processed completely; if not, switching the next thread; and if the processing is finished, finishing the distribution optimization.

2. The method of claim 1, further comprising determining a buffer depth of inter-processor communication vectors based on maximum buffer usage prior to performing intra-processor iterative optimization;

allocating an internal cache region of the processor;

performing inter-processor cache region analysis;

3. The method of claim 2, further comprising performing software tracking in hardware-related software for monitoring the state of processors, threads, and buffers during FPGA emulation;

in the buffer optimization process, if the target throughput is reached, or the maximum number of iterations is reached, or the number of consecutive rejections exceeds a certain threshold, the iteration terminates.

4. The method of optimizing allocation of communication buffers for automated multi-threaded code generation according to any of claims 1 to 3, further comprising:

giving a graph G ═ T, C, where T is a set of threads and C is a set of communication vectors between different threads; if each thread in the graph can reach other threads through a wired path, the G is called a thread strong connection component and is marked as TSCC;

giving a graph G ═ (B, R), where B is the set of blocks and R is the set of dependencies between different blocks in the thread; if each block in the graph can reach other blocks through wired paths, the G is called a block strongly-connected component and is marked as BSCC;

a suitable communication buffer with a circularly dependent task topology is set according to TSCC and BSCC.