CN101923492A

CN101923492A - Method for executing dynamic allocation command on embedded heterogeneous multi-core

Info

Publication number: CN101923492A
Application number: CN 201010251261
Authority: CN
Inventors: 过敏意; 娄林; 伍倩; 朱寅; 沈耀; 马曦; 唐飞龙
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2010-08-11
Filing date: 2010-08-11
Publication date: 2010-12-22
Anticipated expiration: 2030-08-11
Also published as: CN101923492B

Abstract

The invention discloses a method for executing a dynamic allocation command on embedded heterogeneous multi-core in the technical field of computers. The method comprises the following steps of: partitioning a binary code program to obtain a plurality of basic blocks; respectively selecting each basic block so as to obtain a target processing core for executing each basic block; translating a basic block which corresponds to the obtained target processing core so as to obtain a translated binary code on the target processing core; and performing statistics on execution frequency of each basic block, marking a basic block of which the execution frequency is greater than a threshold value T as a hot-spot basic block, and caching the translated binary code of the hot-spot basic block into the cache. The method dynamically allocates commands onto each heterogeneous multi-core to be executed according to the processing capacity and load condition of the system multi-core and the like, so that the method overcomes the defect that static scheduling cannot dynamically allocate resources and also reduces the complexity of dynamic thread division. Therefore, the execution efficiency of the program on the heterogeneous multi-core is further improved.

Description

On embedded heterogeneous multi-core, carry out the method that dynamic assignment is instructed

Technical field

What the present invention relates to is a kind of method of field of computer technology, specifically is a kind of method of carrying out the dynamic assignment instruction on the embedded heterogeneous multi-core.

Background technology

In recent years, because people are more and more higher to the requirement of computing power, the processing power of CPU also needed further raising.At first, the performance of raising CPU mainly is to realize by improving dominant frequency and front-side bus frequency and enlarging buffer memory.But the raising dominant frequency has proposed very big challenge to the problem of CPU power consumption and heat radiation, and enlarging buffer memory merely also needs to increase cost, and the development of core processor has reached a limit.In order further to improve the processing power of CPU, the polycaryon processor technology progresses into market.Since 2006, companies such as Intel and AMD have issued polycaryon processors such as double-core, four nuclears respectively.Along with polycaryon processor comes into the market completely, the computing machine multinuclear epoch arrive.

Integrated a plurality of microprocessor cores in single-chip, the executive routine code under the situation that does not promote frequency of operation, not only can improve handling property, and solve the problem of power consumption well fully concurrently, becomes the developing direction of following processor.But, the simple quantity that increases micro-processor kernel can not improve the processing power of CPU ideally, aspect the parallel processing capability of enhancement process device, the most important still parallelization of program itself, but the serial operating part of most of programs still can hinder the lifting of handling property.It is generally acknowledged that in the processor architecture of this isomorphism multinuclear, 4-8 kernel may be the limit that polycaryon processor can obtain better performance boost effect.

In order further to improve the processing power of CPU, the processor core of integrated a plurality of isomeries in a chip a kind of heterogeneous processor framework appearred, i.e..In this framework, different processor cores can be brought into play higher treatment efficiency (considering factors such as performance, power consumption, technology and cost) for application-specific, be good at the processing floating-point operation such as the core that has, and the kernel that has is good at work such as signal Processing.Make full use of the function difference of heterogeneous polynuclear, can satisfy the demand of different field different application more fully.The chip Cell that is jointly made by IBM, Sony and Toshiba etc. is the model of this heterogeneous processor just, and it is one piece of polycaryon processor that has 9 hardware cores.In the Cell chip, have only one to be the Power processor of IBM standard, all the other 8 kernels all are for handling the coprocessor that image is custom-designed, be used for floating-point operation.Wherein, the main function of primary processor is exactly to be responsible for the distribution of task, and actual floating-point operation work all is to be finished by coprocessor.Because the coprocessor among the Cell only is responsible for the floating-point operation task, required operation rule is very simple, and corresponding circuit logic is like this equally, as long as the CPU running frequency is enough high, Cell just can obtain surprising floating-point usefulness.

At present, heterogeneous multi-nucleus processor mainly contains three classes, according to the difference of instruction set, the equal mutually kind of instruction set identical but hardware configuration difference, the complete difference of instruction set and part instruction set core is arranged.Wherein for the heterogeneous multi-nucleus processor that the universal instruction set core is arranged, its different separately partial design is suitable for handling the application of different demands for the instruction set of expansion.In order to catch up with the developing steps of heterogeneous multi-nucleus processor, disposal route in the time of need designing more advanced compiling system and operation towards heterogeneous multi-nucleus processor the time comprises the design of programming language, compilation model and Runtime Library etc.In this respect, in the main working set in the optimization of task scheduling strategy and Thread-Level Parallelism.

The static scheduling of existing task mainly contains two kinds of methods: a kind of is at compiling duration different code compiles to be become to be fit to the binary code that different process nuclear is carried out; Another kind is the Runtime Library of link different disposal nuclear when operation.These two kinds of methods all are statically task assignment to be carried out on specific process nuclear, therefore require program just to consider the isomerism of different IPs when writing, and the programming personnel must have understanding fully to isomery nuclear.This method is not considered problems such as the load balancing of heterogeneous polynuclear and power consumption in addition, and the improper distribution of task may make the program implementation time increase, and informational needs such as load balancing just can be known when operation.

Find from the retrieval of prior art document, problem during for the complexity that overcomes this programming and operation, thread-level dynamic dispatching method has appearred, existing this method is carried out varigrained thread dividing to task, make different threads to carry out concurrently on different cores, and then improved execution efficient.But this method need be considered data dependency, in fact be difficult to the procedure division that serial is carried out is become the multithreading task of executed in parallel, in addition because the data dependency of different threads, also need to consider problems such as data sync and buffer consistency, may make thread scheduling become complicated more, cause the performance of program further to descend.

Summary of the invention

The objective of the invention is to overcome above-mentioned deficiency of the prior art, provide a kind of and on embedded heterogeneous multi-core, carry out the method that dynamic assignment is instructed.The present invention according to factors such as the processing power of system's multinuclear and loading condition dynamically distribution instruction to each heterogeneous polynuclear, carry out, thereby remedied static scheduling can not the dynamic assignment resource deficiency, also reduce the complicacy of dynamic thread dividing, thereby further improved the execution efficient of program on heterogeneous polynuclear.

The present invention is achieved by the following technical solutions, the present invention includes following steps:

The first step in being written into the process of binary code program, is divided processing to binary code program, obtains some fundamental blocks.

Described division is handled, be with the program between i bar entry instruction and the jump instruction of i+1 bar as i+1 fundamental block, wherein: 0≤i, entry instruction be the instruction that forwards to of jump instruction or closely follow jump instruction after instruction.

Second step, respectively each fundamental block is selected to handle, obtain carrying out the target processing core of each fundamental block.

Described selection is handled, and may further comprise the steps:

2.1) obtain with i fundamental block in j bar instruction A have the instruction A* of same function;

2.2) obtain the process nuclear A of all processing instruction A or instruction A*, and then handled the process nuclear set I of every instruction in i the fundamental block simultaneously;

2.3) according to p=1/N, obtain the execution performance of every instruction of each process nuclear execution among the process nuclear set I, and each process nuclear is carried out the execution performance addition of every instruction, obtain each process nuclear in this set and carry out the average behavior of i fundamental block, wherein: p is the execution performance of processor execution command, and N is that this processor is carried out the required instruction cycles of this instruction;

2.4) according to n=P/L, obtain the performance load ratio of each process nuclear, wherein: n is the performance load ratio, and L is the running load of this process nuclear, and P is the average behavior that this process nuclear is carried out i fundamental block;

2.5) from process nuclear set I, select performance load than the target processing core of maximum process nuclear as i fundamental block.

In the 3rd step, the corresponding fundamental block of target processing core that second step was obtained carries out Translation Processing, obtains the binary code after the translation on this target processing core.

Described Translation Processing may further comprise the steps:

3.1) when having the binary code translated to this target processing core in the high-speed cache, directly carry out 3.3); Otherwise, carry out 3.2) after carry out 3.3 again);

3.2) binary code of correspondence when this fundamental block is translated into this target processing core processing dynamically;

3.3) when there are data dependency in binary code and the preorder fundamental block carried out, carry out 3.4) and after carry out 3.5 again); Otherwise, directly carry out 3.5);

3.4) after the preorder fundamental block carry out to finish, obtain the execution result on the corresponding process nuclear, when the preorder fundamental block is assigned to when carrying out on other process nuclear, switched system context then;

3.5) binary code of this fundamental block correspondence of execution on this target processing core.

In the 4th step, the execution frequency of adding up each fundamental block is labeled as the focus fundamental block with carrying out the fundamental block of frequency greater than threshold value T, and the binary code buffer memory after the focus fundamental block translated is to the high speed buffer memory.

The 5th step, returned for second step, carry out the dynamic dispatching of next fundamental block.

Compared with prior art, the invention has the beneficial effects as follows: need not fully to understand the instruction feature of heterogeneous polynuclear, therefore compiler has also obtained the platform transparency, need or not compile different binary codes at different architectural framework codings.Simultaneously, this method is in having the heterogeneous multi-core system of universal instruction set, the function of performance dynamic instruction scheduling, according to factors such as the processing power of system's multinuclear and loading condition dynamically distribution instruction to each heterogeneous polynuclear, carry out, thereby remedied static scheduling can not the dynamic assignment resource deficiency, also reduce the complicacy of dynamic thread dividing, thereby further improved the execution efficient of program on heterogeneous polynuclear.This method does not need to carry out thread dividing, can farthest walk abreast under the situation that no datat relies on, and has therefore reduced about 15% thread synchronization expense.

Description of drawings

Fig. 1 is the method flow synoptic diagram of embodiment.

Embodiment

Below in conjunction with accompanying drawing method of the present invention is further described: present embodiment is being to implement under the prerequisite with the technical solution of the present invention, provided detailed embodiment and concrete operating process, but protection scope of the present invention is not limited to following embodiment.

Embodiment

As shown in Figure 1, present embodiment may further comprise the steps:

According to the jump instruction of binary program code, define the entry instruction and the exit instruction of fundamental block in the present embodiment, each fundamental block is one group of instruction sequence of carrying out in proper order, and entry instruction is divided into a fundamental block to the instruction code between exit instruction.The entry instruction of fundamental block is article one instruction of fundamental block, i.e. instruction after the jump instruction or the instruction that jumps to, and exit instruction is a last instruction or the jump instruction of other fundamental block entry instruction.Partial ordering relation before and after keeping between the different fundamental blocks.

Described selection is handled, and may further comprise the steps:

2.1) obtain with i fundamental block in j bar instruction A have the instruction A of same function ^*

2.2) obtain all processing instruction A or instruct A ^*Process nuclear A, and then handled the process nuclear set I of every instruction in i the fundamental block simultaneously;

Described Translation Processing may further comprise the steps:

Utilize the binary translation technology former binary code to be mapped to the instruction of target processing core in the present embodiment, have two kinds of situations: a kind of situation is the similar that instruction is mapped to the instruction of target processing core, there is one-to-one relationship in they, and can finish same function; Another kind of situation be have on the target processing core one or one group of instruction that more has superiority possess with former fundamental block in one group of function that instruction is of equal value, such as the AVX (senior vector extension instruction set) of Intel.Instruction after the translation or instruction group have higher execution efficient or power consumption ratio.

Present embodiment can directly take out the frequent code of carrying out from high-speed cache, thereby greatly reduces the expense of code translation mapping.

Present embodiment is in the process of instruction-level dynamic optimization, pipeline organization is formed in the execution of code on the translation mapping of the selection of target processing core, binary command and the process nuclear, obviously reduced the handover overhead of command assignment to heterogeneous polynuclear, compare with dynamic thread-level dispatching method, this method does not need to carry out thread dividing, under the situation that no datat relies on, can farthest walk abreast, therefore reduce about 15% thread synchronization expense.In addition, the target processing core code after the employing caches translations has further reduced expense, has promoted system performance, carries out the computing of video compress routine on a kind of chip of the DSP of having process nuclear, has on average reached 1.8 times speed-up ratio.

Claims

1. on embedded heterogeneous multi-core, carry out the method that dynamic assignment is instructed for one kind, it is characterized in that, may further comprise the steps:

The first step in being written into the process of binary code program, is divided processing to binary code program, obtains some fundamental blocks;

Second step, respectively each fundamental block is selected to handle, obtain carrying out the target processing core of each fundamental block;

In the 3rd step, the corresponding fundamental block of target processing core that second step was obtained carries out Translation Processing, obtains the binary code after the translation on this target processing core;

In the 4th step, the execution frequency of adding up each fundamental block is labeled as the focus fundamental block with carrying out the fundamental block of frequency greater than threshold value T, and the binary code buffer memory after the focus fundamental block translated is to the high speed buffer memory;

2. the method for carrying out the dynamic assignment instruction on the embedded heterogeneous multi-core according to claim 1, it is characterized in that, division described in the first step is handled, be as i+1 fundamental block with the program between i bar entry instruction and the jump instruction of i+1 bar, wherein: 0≤i, entry instruction is the instruction that forwards to of jump instruction or immediately following the instruction after the jump instruction.

3. the method for carrying out the dynamic assignment instruction on the embedded heterogeneous multi-core according to claim 1 is characterized in that, the selection described in second step is handled, and may further comprise the steps:

4. the method for carrying out the dynamic assignment instruction on the embedded heterogeneous multi-core according to claim 1 is characterized in that, the Translation Processing described in the 3rd step may further comprise the steps: