CN112015692A - Simulink-oriented inter-core communication optimization method for automatically generating multi-thread codes - Google Patents

Simulink-oriented inter-core communication optimization method for automatically generating multi-thread codes Download PDF

Info

Publication number
CN112015692A
CN112015692A CN202010698129.XA CN202010698129A CN112015692A CN 112015692 A CN112015692 A CN 112015692A CN 202010698129 A CN202010698129 A CN 202010698129A CN 112015692 A CN112015692 A CN 112015692A
Authority
CN
China
Prior art keywords
thread
communication
processor
buffer
cache region
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010698129.XA
Other languages
Chinese (zh)
Inventor
汪楠
柳宜川
邱源
许博仁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai aerospace computer technology research institute
East China University of Science and Technology
Shanghai Academy of Spaceflight Technology SAST
Original Assignee
Shanghai aerospace computer technology research institute
East China University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai aerospace computer technology research institute, East China University of Science and Technology filed Critical Shanghai aerospace computer technology research institute
Priority to CN202010698129.XA priority Critical patent/CN112015692A/en
Publication of CN112015692A publication Critical patent/CN112015692A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/167Interprocessor communication using a common memory, e.g. mailbox
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention provides an interprocess communication optimization method for automatically generating a multi-thread code facing Simulink, which adopts the technology of combining static analysis and dynamic simulation to effectively allocate a communication cache region so as to further reduce the synchronization cost and improve the utilization rate of a processor. FPGA simulation is introduced in the inter-processor cache region allocation flow. In addition, the invention allocates proper amount of entries for different communication buffer areas under fixed memory overhead so as to minimize synchronous waiting time and thread switching time. The invention introduces an optimization method into the Simulink-based code generation process, and reduces the communication overhead and improves the system performance by combining with the communication pipeline technology. In terms of allocating communication buffer, a technology combining static analysis and dynamic simulation is adopted to further reduce the synchronization cost and improve the utilization rate of the processor.

Description

Simulink-oriented inter-core communication optimization method for automatically generating multi-thread codes
Technical Field
The invention relates to the technical field of communication, in particular to an interproximal communication optimization method for automatically generating a multi-thread code facing Simulink.
Background
The communication frequency is increasing with the increasing complexity of emerging embedded applications and the increasing number of processors in the implemented multi-processor SoC architecture. Software development of multiprocessor systems on chip (MPSoC) involves a great deal of work, such as adjusting communication between concurrent threads and avoiding deadlocks, manually adjusting code to accommodate different types of processors and communication protocols, and distributing code and data between processors. Automated techniques may help designers to address these difficulties and find a satisfactory solution.
The Simulink system model is a widely used code automatic generation model, and has two architectures, including a software thread and a hardware architecture. The Simulink system model is specified as a three-layer hierarchy: a system layer, a subsystem layer and a thread layer; the system layer describes the system architecture consisting of the CPU subsystems and the inter-subsystem communication channels. The subsystem layer describes a CPU subsystem architecture that includes a set of threads and subsystem internal communication channels between them. The thread layer describes a software thread consisting of Simulink blocks and links between them.
A Simulink link is a one-to-many link that connects an output port of one block to one or more input ports of other blocks. The three-layer model of the Simulink system model helps the code generator to clearly distinguish different types of Simulink links, thereby supporting inter-core communication optimization techniques.
In the automatic generation of Simulink-based code, the difference in task mapping and complexity between different processors results in a difference in computation duration. All of this can block some threads that are sending or waiting for messages and cause frequent thread switches.
In general, buffers are always used to coordinate the rate gaps between message-generating ports and using ports. A communication buffer is a buffer with multiple entries between the send blocks and the corresponding receive blocks of different threads. It allows buffering of messages over multiple cycles.
The main problem of current automatic code generation techniques is how to avoid synchronization latency and parallelize the computational operations of different processors as much as possible. In order to achieve good scalability with an increasing number of processors, it is necessary to specify an appropriate number of threads that can run in parallel so that all processors can be fully utilized. Fine-grained threads may result in frequent synchronization and result in longer latency in the processor. In order to increase the utilization of the processor, it is necessary to find a way to have more threads work simultaneously.
Disclosure of Invention
In view of the above drawbacks of the prior art, an object of the present invention is to provide a method for optimizing inter-core communication for Simulink-oriented automatic generation of multithreaded code, which is used to solve the problems in the prior art.
In order to achieve the above objects and other related objects, the present invention provides a Simulink-oriented method for optimizing inter-core communication for automatically generating multi-threaded codes, comprising the steps of:
setting the buffer depth of all communication vectors in the communication vector set C to be 1;
for a certain thread T in the thread set T, when the minimum cache depth related to the certain thread T is increased by a unit amount, calculating the memory usage amount M (T) of the thread, and setting a temporary thread set before calculation
Figure BDA0002593614230000011
Determining whether the memory usage M (t) of the thread exceeds the available memory amount Mavl(ii) a If the memory usage M (t) is greater than MavlDirectly switching to the next thread when T' is empty; if the memory usage M (t) does not exceed MavlCalculating the total thread switching time S (T), making T ═ T { [ T }, and selecting the thread with the minimum M (T) } S (T) when T' is not empty;
allocating the communication buffer area inlet to a communication vector so as to increase the minimum buffer area depth by 1, and then judging whether all threads are processed completely; if not, switching the next thread; and if the processing is finished, finishing the distribution optimization.
Optionally, before performing iterative optimization in the processor, determining a cache depth of an inter-processor communication vector according to the maximum cache usage amount;
allocating an internal cache region of the processor;
performing inter-processor cache region analysis;
introducing FPGA simulation in the inter-processor cache region allocation flow, and judging whether the thinning of the processors is finished; if the thinning of the processors is finished, updating a cache region between the processors; if the thinning of the processor is not finished, judging whether the iteration is finished;
if the iteration is not finished, thinning the processor again; if the iteration is finished, the allocation of the cache region among the processors is finished.
Optionally, when performing FPGA simulation, inserting software traces in the hardware-related software for monitoring the states of the processor, the thread, and the buffer is further included.
Optionally, in the buffer optimization process, if the target throughput is reached, or the maximum number of iterations is reached, or the number of consecutive rejections exceeds a certain threshold, the iteration is terminated.
Optionally, the method further comprises:
giving a graph G ═ T, C, where T is a set of threads and C is a set of communication vectors between different threads; if each Thread in the graph can reach other threads through a wired path, G is called a Thread Strongly Connected Component (TSCC);
giving a graph G ═ (B, R), where B is the set of blocks and R is the set of dependencies between different blocks in the thread; if each Block in the graph can reach other blocks through wired paths, G is called a Block Strongly Connected Component (BSCC);
a communication buffer suitable for a task topology map with circular dependency is set according to TSCC and BSCC.
As described above, the invention provides an interprocess communication optimization method for Simulink to automatically generate a multi-thread code, which has the following beneficial effects:
the invention concentrates the communication vector into a set CInitializing the depth of a cache region of all communication vectors to be 1; for a certain thread T in the thread set T, when the minimum cache depth related to the certain thread T is increased by a unit amount, calculating the memory usage amount M (T) of the thread, and setting a temporary thread set T' ═ phi before calculation; judging whether the memory usage amount M (t) of the thread exceeds the corresponding available memory amount Mavl(ii) a If the memory usage M (t) exceeds MavlDirectly switching to the next thread when T' is empty; if the memory usage M (t) does not exceed MavlCalculating the total thread switching time S (T), making T ═ T { [ T }, and selecting the thread with the minimum M (T) } S (T) when T' is not empty; allocating a cache region inlet for the communication vector of the communication vector to increase 1 minimum cache region depth, and judging whether all threads are processed completely; if not, switching the next thread; and if the processing is finished, finishing the distribution optimization. The invention adopts the technology of combining static analysis and dynamic simulation, and effectively allocates the communication cache region so as to further reduce the synchronization cost and improve the utilization rate of the processor. In the inter-processor cache region allocation process, the solution space is explored in an iterative mode, and the local optimal solution of the algorithm is selected by using the simulated annealing algorithm. In order to quickly and accurately obtain a simulation result, FPGA simulation is introduced into an inter-processor cache region allocation flow. In addition, under the condition of fixed memory overhead, the invention allocates proper number of entries for different communication buffer areas so as to minimize synchronous waiting time and thread switching time. The invention introduces an optimization method into the code generation flow based on Simulink, and reduces the communication overhead and improves the system performance by combining with the communication pipeline technology; in addition, on the aspect of allocating the communication buffer area, a technology combining static analysis and dynamic simulation is adopted to further reduce the synchronization cost and improve the utilization rate of the processor.
Drawings
FIG. 1(a) is a diagram of a DMA model according to an embodiment;
FIG. 1(b) is a diagram of an embodiment of a shared memory model shown in FIG. 1 (a);
FIG. 1(c) is a thread code diagram of FIGS. 1(a) and 1 (b);
FIG. 2 is a schematic diagram illustrating an allocation procedure of a communication buffer in a processor according to an embodiment;
FIG. 3 is a pseudo code diagram of the communication buffer algorithm of FIG. 2;
FIG. 4 is a flow diagram illustrating an allocation process of a communication buffer between processors according to an embodiment;
FIG. 5 is a schematic diagram illustrating an embodiment of an inter-core communication buffer allocation based on FPGA simulation;
FIG. 6(a) is a schematic view of a thread loop model according to an embodiment;
FIG. 6(b) is a thread code diagram provided in accordance with an embodiment;
FIG. 6(c) is a schematic diagram illustrating an execution sequence of a CPU with two entries allocated to a cache according to an embodiment;
FIG. 7(a) is a schematic view of a thread loop model according to another embodiment;
FIG. 7(b) is a thread code diagram for an implementation with k 2;
fig. 7(c) is a schematic view of the processing structure of fig. 7 (b).
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
Referring to fig. 1 to 7, the present invention provides a Simulink-oriented inter-core communication optimization method for automatically generating a multi-thread code, including the following steps:
setting the buffer depth of all communication vectors in the communication vector set C to be 1;
for a certain thread T in the thread set T, when the minimum cache depth related to the certain thread T is increased by a unit amount, calculating the memory usage amount M (T) of the thread, and setting a temporary thread set before calculation
Figure BDA0002593614230000021
Judging whether the memory usage amount M (t) of the thread exceeds the corresponding available memory amount Mavl(ii) a If the memory usage amount M (t) exceeds the available memory amount MavlDirectly switching to the next thread when T' is empty; if the memory usage amount M (t) does not exceed the available memory amount MavlCalculating the total thread switching time S (T), making T ═ T { [ T }, and selecting the thread with the minimum M (T) } S (T) when T' is not empty;
allocating a communication buffer area inlet for the communication vector of the communication vector to increase 1 minimum buffer area depth and judging whether all threads are processed completely; if not, switching the next thread; and if the processing is finished, finishing the distribution optimization.
The invention introduces an optimization method into the Simulink-based code generation process, and reduces the communication overhead and improves the system performance by combining with the communication pipeline technology. In terms of allocating communication buffer, a technology combining static analysis and dynamic simulation is adopted to further reduce the synchronization cost and improve the utilization rate of the processor.
The model in fig. 1 is used to illustrate the idea of inter-processor and intra-processor communication buffers, respectively. FIG. 1(c) shows the same thread code for both models.
For the model in FIG. 1(a), F0And F1May be performed in parallel. If F0And F1Is fixed in execution time, and F0Is shorter than F1Then S is0Is always higher than R0The reception rate of. Thus, the buffer will be full for a certain period of time, from which point on the system will work as if the buffer did not exist. Also, if F0Is longer than F1The buffer will always be empty. Therefore, if the communication rate is constant, the inter-processor communication buffer is not very useful. Unfortunately, the actual interest rate is not always constant. Since different stimuli cause different branches, and further lead the thread scheduling results to be different, the rate has strong dependency on the stimuli. The interprocessor communication buffer can adapt to the dynamic communication speed, and reduce the synchronous waiting time.
For the model in fig. 1(b), in the case where the communication buffer allocates one entry, the average number of thread switches performed in one iteration is 2, and in the case where the communication buffer allocates two entries, the average number of thread switches is reduced to 1. It can be concluded that with a communication buffer of N entries, the thread switch time can be reduced to 1/N.
Based on the above discussion, it can be concluded that: deeper buffers may reduce synchronization latency and thread switch time. Ideally, the communication buffer depth is greater than its total number of iterations. However, this ideal case is not achievable since the available memory size for a given hardware platform is not infinite. The present invention therefore provides a technique for allocating an appropriate number of entries to different communication buffers at a fixed memory overhead to minimize synchronization latency and thread switching time.
Since global memory is always much larger than local memory, the depth of the interprocessor communication cache is assumed to be infinite at this step. Next, an inter-processor communication buffer is allocated based on the result of the first step.
For the allocation of the cache region in the processor, the method comprises the following steps:
during in-processor cache allocation, the optimization goal is how to reduce the thread's switching time. To better explain the working principle of cache allocation in a processor, the following basic symbols are explained:
-T is a thread set;
-C is a set of communication vectors;
—Ctis a set of in-processor communication vectors associated with a thread T, where T e T;
depth is the buffer depth of the communication vector C, where C ∈ C;
size is the entry communication buffer size of the communication vector C, where C ∈ C. For a merged traffic vector, its magnitude is the sum of the original vectors;
s is the sum of the average thread switch times of all threads T in the thread set T during one iteration of execution.
A flow chart of an intra-processor cache entry allocation algorithm (labeled algorithm 1) is shown in fig. 2, and pseudo code is shown in fig. 3. The algorithm initially allocates an entry for each buffer. Considering that a thread is associated with many communication vectors, let N be the minimum depth of the communication buffer associated with it. The thread may perform up to N iterations in succession before the thread switches. Thus, the average value of the thread switching time during one iteration is 1/N. Thus, initially the average switch time per thread is 1, and the sum is | T | (a function of the number of computing elements). It then starts allocating traffic vectors until the memory is exhausted. First, for each thread t, it calculates the memory usage m (t) when the minimum cache depth associated with it increases by 1. If the memory usage does not exceed the available memory (M)avl) Then the total thread switch time s (t) will also be calculated based on the previous allocation. Calculated, it selects M (t) not exceeding MavlAnd m (t) s (t) the smallest thread, then allocates a communication buffer entry for its communication vector to increase by 1 the minimum buffer depth, and records the current result as the basis for the next allocation.
Given the amount of memory available MavlThe number of entries that can be allocated to a communication vector is fixed. Therefore, after a plurality of iterations, increasing the depth of the communication vector by one in any thread will result in the total memory allocatedNumber exceeds Mavl. In this case, no one traffic vector can allocate more buffers, so T' is empty and the algorithm terminates.
The proposed inter-processor buffer allocation procedure is presented in fig. 4, which uses a simulated annealing algorithm in its iterations to select a locally optimal solution. In order to quickly and accurately obtain a simulation result, FPGA simulation is introduced into an inter-processor cache region allocation flow.
Before the iteration starts, a static method needs to be employed to calculate the depth of the inter-processor cache (step 1 in fig. 4). In this step, the maximum buffer usage for each inter-processor communication vector may be obtained. To analyze buffer usage without executing an application, it will first be assumed that each block has a fixed execution time. In conjunction with a given scheduling policy, the execution order of each processor may be determined without executing the application, and the buffer depth of the inter-processor communication vector may be determined based on the maximum buffer usage.
The FPGA simulation process then begins. If the throughput is optimized, receiving the optimization of the buffer area; otherwise the acceptance probability is e (T)f-Tc)/(Tf-Tt)(Tf、TcOptimizing throughput before and after buffer, TtA target throughput defined for the designer). If cache optimization is accepted, the cache declaration will be updated.
The FPGA-based simulation process is shown in FIG. 5. At HdS (hardware dependent software) a software trace is inserted to monitor the state of the processor, threads and buffers. A thread will be blocked if it attempts to send a message to a full buffer or receive a message from an empty buffer (referred to as an artificial block and a true block, respectively). True blocks occur regardless of cache depth, whereas artificial blocks may be eliminated if more cache entries have been allocated. The total duration and total throughput of the artificial blocks for each communication vector can be derived from the software traces.
In the buffer optimization process, more buffer entries are attempted to be allocated to communication vectors whose artificial blocks are longer in duration. The iteration terminates if the target throughput is reached or the maximum number of iterations is reached, or if the number of consecutive rejections exceeds a certain threshold.
The above described method of communication buffer allocation can reduce synchronization latency and thread switching time, but it is not applicable to task topologies with circular dependencies. The above situation is illustrated by the models in fig. 6(a) and 6(b), where T0And T1A dependency loop is formed and the number of available communication buffer entries does not always exceed one, even a plurality of entries are allocated to the communication buffer. FIG. 6(c) shows a CPU0Including thread switching. In execution, only one entry is used, although each cache can be allocated multiple entries. The situation is similar if the threads in the dependent loop are from different processors.
In a task topology with circular dependency, the reason for blocking the communication buffer is that at most one entry is available even if multiple entries are allocated. If at least one entry is available, the communication buffer is available for the task topology with circular dependency.
Giving a graph G ═ T, C, where T is a set of threads and C is a set of communication vectors between different threads; if each Thread in the graph can reach other threads through a wired path, G is called a Thread Strongly Connected Component (TSCC);
giving a graph G ═ (B, R), where B is the set of blocks and R is the set of dependencies between different blocks in the thread; if each Block in the graph can reach other blocks through wired paths, G is called a Block Strongly Connected Component (BSCC);
taking the model of FIG. 7(a) and the corresponding code of FIG. 7(b) as an example, assume T0And T1Mapped on the same processor, shown in FIG. 7(c) is the processing result of the code in FIG. 7 (b). The number of delays in this example is k, so F0And S0Can be preprocessed (k-1) times at most, so that S is performed while (1) is being performed0After which k entries are available. If the communication vectorS0->R0And S1->R1Being assigned k entries, the number of thread switches will be reduced to 1/k. The communication buffering technique discussed above is for one block cycle, which may cause the allocation of buffer areas in the BSCC to be unreasonable. Taking the model in FIG. 7(a) as an example, assume T0、T1And T2Mapping to the same processor and applying BC first2(F0->S0->R0->F1->S1->R1->F2->F3->S2->R2->F4->S3->R3->F5). Due to BC2In which there are three delay periods, V0(S0->R0)、V1(S1->R1)、V2(S2->R2) And V3(S3->R3) Three buffer entries are allocated. Then apply BC1,BC1V in0And V3Is allocated three entries, however due to BC1Only two delay periods, and therefore, an improper entry assignment occurs.
In view of the above, the present invention modifies the previously mentioned in-processor cache allocation algorithm (algorithm 1) to allocate a reasonable number of entries for a cache with a circularly dependent task topology. 1) Each communication vector is set to the maximum depth: the number of allocated entries should be less than the maximum depth; for all traffic vectors in the BSCC, its maximum depth is equal to the number of minimum delay periods for all block periods in the BSCC. 2) Thread switch time in TSCC: the average number of thread switches depends on the minimum depth of all buffers that the TSCC is associated with, but not the minimum depth of the buffers that the threads are associated with. Therefore, if the minimum depth of the cache associated with a thread in the TSCC is greater than the minimum depth of all caches, the thread cannot be selected.
For the above reasons, it is necessary to add a constraint for the traffic vectors in the BSCC before the do-while loop of the previously proposed intra-processor cache entry allocation algorithm (algorithm 1):
Figure BDA0002593614230000041
for threads in TSCC, row 6 and row 7 of algorithm 1 will be modified as follows:
Figure BDA0002593614230000042
through the modification, the cache region entry allocation algorithm with the circularly dependent task topological graph can be obtained.
Firstly, the invention provides a Simulink-oriented inter-core communication optimization method for automatically generating multi-thread codes, which introduces an optimization method based on simulated annealing into a Simulink-based code generation flow and combines with a communication pipeline technology to reduce communication overhead and improve system performance. Secondly, the invention also adopts the technology of combining static analysis and dynamic simulation, and effectively allocates the communication buffer area so as to further reduce the synchronization cost and improve the utilization rate of the processor. An inter-processor cache allocation flow is proposed to quickly complete the search of an algorithm solution space by using a simulated annealing algorithm to select a locally optimal solution. Introducing FPGA simulation in the inter-processor cache region allocation flow, and judging whether the thinning of the processors is finished; if the thinning of the processors is finished, updating a cache region between the processors; if the processor does not refine completely, judging whether iteration is finished. Finally, the present invention also allocates an appropriate number of entries to different communication buffers at a fixed memory overhead to minimize synchronization latency and thread switching time.
The invention also provides a communication cache region allocation optimization system for multi-thread code automatic generation, which comprises:
setting the buffer depth of all communication vectors in the communication vector set C to be 1;
for a thread T in the thread set T, when the minimum buffer depth related to the thread T is increased by a unit amount, the internal of the thread is calculatedStoring the usage M (t), and setting a temporary thread set before calculation
Figure BDA0002593614230000051
Judging whether the memory usage amount M (t) of the thread exceeds the corresponding available memory amount Mavl(ii) a If the memory usage amount M (t) exceeds the available memory amount MavlDirectly switching to the next thread when T' is empty; if the memory usage amount M (t) does not exceed the available memory amount MavlCalculating the total thread switching time S (T), making T ═ T { [ T }, and selecting the thread with the minimum M (T) } S (T) when T' is not empty;
allocating an entry to a communication vector of the communication vector so as to increase 1 to the minimum cache depth and judging whether all threads are processed completely; if not, switching the next thread; and if the processing is finished, finishing the distribution optimization.
The system executes the method, and the implementation steps and functions of the method are referred to in the method, and the system is not described in detail.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (4)

1. An interprocess communication optimization method for automatically generating multithreading codes facing Simulink is characterized by comprising the following steps:
setting the buffer depth of all communication vectors in the communication vector set C to be 1;
for a certain thread T in the thread set T, when the minimum cache depth related to the certain thread T is increased by a unit amount, calculating the memory usage amount M (T) of the thread, and setting a temporary thread set before calculation
Figure FDA0002593614220000011
Determining whether the memory usage M (t) of the thread exceeds the available memory amount Mavl(ii) a If the memory usage M (t) is greater than MavlDirectly switching to the next thread when T' is empty; if the memory usage M (t) does not exceed MavlCalculating the total thread switching time S (T), making T ═ T { [ T }, and selecting the thread with the minimum M (T) } S (T) when T' is not empty;
allocating the communication cache region inlet to the communication vector of the communication cache region to increase the minimum cache region depth by 1 and judging whether all threads are processed completely; if not, switching the next thread; and if the processing is finished, finishing the distribution optimization.
2. The method of claim 1, further comprising determining a buffer depth of inter-processor communication vectors based on maximum buffer usage prior to performing intra-processor iterative optimization;
allocating an internal cache region of the processor;
performing inter-processor cache region analysis;
introducing FPGA simulation in the inter-processor cache region allocation flow, and judging whether the thinning of the processors is finished; if the thinning of the processors is finished, updating a cache region between the processors; if the thinning of the processor is not finished, judging whether the iteration is finished;
if the iteration is not finished, thinning the processor again; if the iteration is finished, the allocation of the cache region among the processors is finished.
3. The method of claim 2, further comprising performing software tracking in hardware-related software for monitoring the state of processors, threads, and buffers during FPGA emulation;
in the buffer optimization process, if the target throughput is reached, or the maximum number of iterations is reached, or the number of consecutive rejections exceeds a certain threshold, the iteration terminates.
4. The method of optimizing allocation of communication buffers for automated multi-threaded code generation according to any of claims 1 to 3, further comprising:
giving a graph G ═ T, C, where T is a set of threads and C is a set of communication vectors between different threads; if each thread in the graph can reach other threads through a wired path, the G is called a thread strong connection component and is marked as TSCC;
giving a graph G ═ (B, R), where B is the set of blocks and R is the set of dependencies between different blocks in the thread; if each block in the graph can reach other blocks through wired paths, the G is called a block strongly-connected component and is marked as BSCC;
a suitable communication buffer with a circularly dependent task topology is set according to TSCC and BSCC.
CN202010698129.XA 2020-07-21 2020-07-21 Simulink-oriented inter-core communication optimization method for automatically generating multi-thread codes Pending CN112015692A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010698129.XA CN112015692A (en) 2020-07-21 2020-07-21 Simulink-oriented inter-core communication optimization method for automatically generating multi-thread codes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010698129.XA CN112015692A (en) 2020-07-21 2020-07-21 Simulink-oriented inter-core communication optimization method for automatically generating multi-thread codes

Publications (1)

Publication Number Publication Date
CN112015692A true CN112015692A (en) 2020-12-01

Family

ID=73498844

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010698129.XA Pending CN112015692A (en) 2020-07-21 2020-07-21 Simulink-oriented inter-core communication optimization method for automatically generating multi-thread codes

Country Status (1)

Country Link
CN (1) CN112015692A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528583A (en) * 2020-12-18 2021-03-19 广东高云半导体科技股份有限公司 Multithreading comprehensive method and comprehensive system for FPGA development

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003052586A2 (en) * 2001-12-14 2003-06-26 Koninklijke Philips Electronics N.V. Data processing system having multiple processors
CN106980595A (en) * 2014-12-05 2017-07-25 三星半导体(中国)研究开发有限公司 The multiprocessor communication system and its communication means of shared physical memory

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003052586A2 (en) * 2001-12-14 2003-06-26 Koninklijke Philips Electronics N.V. Data processing system having multiple processors
CN106980595A (en) * 2014-12-05 2017-07-25 三星半导体(中国)研究开发有限公司 The multiprocessor communication system and its communication means of shared physical memory

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BILEL BELHADJ MOHAMED: "Intra- and Inter-Processors Memory Size Estimation for Multithreaded MPSoC", 《2006 13TH IEEE INTERNATIONAL CONFERENCE ON ELECTRONICS, CIRCUITS AND SYSTEMS》 *
NAN WANG: "Interconnection Allocation Between Functional Units and Registers in High-Level", 《IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS》 *
余慜: "基于Simulink模型的细粒度多线程技术研究", 《中国博士学位论文全文数据库信息科技辑》 *
马明理等: "多线程环境的高效内存分配技术", 《计算机测量与控制》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528583A (en) * 2020-12-18 2021-03-19 广东高云半导体科技股份有限公司 Multithreading comprehensive method and comprehensive system for FPGA development
CN112528583B (en) * 2020-12-18 2022-04-01 广东高云半导体科技股份有限公司 Multithreading comprehensive method and comprehensive system for FPGA development

Similar Documents

Publication Publication Date Title
Gupta et al. Program implementation schemes for hardware-software systems
US6493863B1 (en) Method of designing semiconductor integrated circuit
Schliecker et al. Bounding the shared resource load for the performance analysis of multiprocessor systems
CN105205205A (en) Method for FPGA coarse-grained parallel wiring based on optimal division of netlist position information
Chandy et al. An evaluation of parallel simulated annealing strategies with application to standard cell placement
CN111274016B (en) Application partitioning and scheduling method of dynamic partial reconfigurable system based on module fusion
Bambha et al. Intermediate representations for design automation of multiprocessor DSP systems
Lee et al. Mapping and scheduling of tasks and communications on many-core SoC under local memory constraint
CN113822004A (en) Verification method and system for integrated circuit simulation acceleration and simulation
CN109558226B (en) DSP multi-core parallel computing scheduling method based on inter-core interruption
CN112015692A (en) Simulink-oriented inter-core communication optimization method for automatically generating multi-thread codes
Wolf et al. An analysis of fault partitioned parallel test generation
Saleem et al. A Survey on Dynamic Application Mapping Approaches for Real-Time Network-on-Chip-Based Platforms
Stitt et al. Thread warping: a framework for dynamic synthesis of thread accelerators
Kwok Parallel program execution on a heterogeneous PC cluster using task duplication
Uddin et al. Cache-based high-level simulation of microthreaded many-core architectures
Li et al. DBEFT: a dependency-ratio bundling earliest finish time algorithm for heterogeneous computing
AkashKumar Heuristic for accelerating run-time task mapping in NoC-based heterogeneous MPSoCs
Pessoa et al. Parallel TLM simulation of MPSoC on SMP workstations: Influence of communication locality
Ma et al. Reducing code size in scheduling synchronous dataflow graphs on multicore systems
Kim et al. Optimization of multi-core accelerator performance based on accurate performance estimation
De Munck et al. Design and performance evaluation of a conservative parallel discrete event core for GES
Khandelia et al. Contention-conscious transaction ordering in multiprocessor dsp systems
Arndt et al. Portable implementation of advanced driver-assistance algorithms on heterogeneous architectures
Wen et al. Design Exploration of An Energy-Efficient Acceleration System for CNNs on Low-Cost Resource-Constraint SoC-FPGAs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20201201