CN101923492B

CN101923492B - Method for executing dynamic allocation command on embedded heterogeneous multi-core

Info

Publication number: CN101923492B
Application number: CN 201010251261
Authority: CN
Inventors: 过敏意; 娄林; 伍倩; 朱寅; 沈耀; 马曦; 唐飞龙
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2010-08-11
Filing date: 2010-08-11
Publication date: 2013-05-01
Anticipated expiration: 2030-08-11
Also published as: CN101923492A

Abstract

The invention discloses a method for executing a dynamic allocation command on embedded heterogeneous multi-core in the technical field of computers. The method comprises the following steps of: partitioning a binary code program to obtain a plurality of basic blocks; respectively selecting each basic block so as to obtain a target processing core for executing each basic block; translating a basic block which corresponds to the obtained target processing core so as to obtain a translated binary code on the target processing core; and performing statistics on execution frequency of each basic block, marking a basic block of which the execution frequency is greater than a threshold value T as a hot-spot basic block, and caching the translated binary code of the hot-spot basic block into the cache. The method dynamically allocates commands onto each heterogeneous multi-core to be executed according to the processing capacity and load condition of the system multi-core and the like, so that the method overcomes the defect that static scheduling cannot dynamically allocate resources and also reduces the complexity of dynamic thread division. Therefore, the execution efficiency of the program on the heterogeneous multi-core is further improved.

Description

The method of executing dynamic allocation command on embedded heterogeneous multi-core

Technical field

What the present invention relates to is a kind of method of field of computer technology, specifically a kind of method of executing dynamic allocation command on embedded heterogeneous multi-core.

Background technology

In recent years, because people are more and more higher to the requirement of computing power, the processing power of CPU also needed further raising.At first, the performance of raising CPU mainly is to realize by improving dominant frequency and front-side bus frequency and enlarging buffer memory.But the raising dominant frequency has proposed very large challenge to the problem of CPU power consumption and heat radiation, and enlarging merely buffer memory also needs to increase cost, and the development of core processor has reached a limit.In order further to improve the processing power of CPU, the polycaryon processor technology progresses into market.Since 2006, the companies such as Intel and AMD have issued respectively the polycaryon processors such as double-core, four nuclears.Along with polycaryon processor comes into the market completely, the computing machine multinuclear epoch arrive.

Integrated multi-microprocessor nuclear in single-chip can complete parallel ground executive routine code, in the situation that do not promote frequency of operation, not only can improve handling property, and solve well the problem of power consumption, becomes the developing direction of following processor.But, the simple quantity that increases micro-processor kernel can not improve the processing power of CPU ideally, aspect the parallel processing capability that strengthens processor, the most important or parallelization of program itself, but the serial operating part of most of programs still can hinder the lifting of handling property.It is generally acknowledged, in the processor architecture of this isomorphism multinuclear, 4-8 kernel may be the limit that polycaryon processor can obtain better performance boost effect.

In order further to improve the processing power of CPU, the processor core of integrated a plurality of isomeries in a chip a kind of heterogeneous processor framework appearred, i.e..In this framework, different processor cores can be brought into play higher treatment efficiency (considering the factors such as performance, power consumption, technique and cost) for application-specific, be good at the processing floating-point operation such as the core that has, and some kernels are good at the work such as signal processing.Take full advantage of the function difference of heterogeneous polynuclear, can satisfy more fully the demand of different field different application.The chip Cell that is jointly made by IBM, Sony and Toshiba etc. is the model of this heterogeneous processor just, and it is one piece of polycaryon processor that has 9 hardware cores.In the Cell chip, only have one to be the Power processor of IBM standard, all the other 8 kernels all are for processing the coprocessor that image is custom-designed, be used for floating-point operation.Wherein, the Major Function of primary processor is exactly the distribution of being responsible for task, and actual floating-point operation work all is to be finished by coprocessor.Because the coprocessor among the Cell only is responsible for the floating-point operation task, required operation rule is very simple, and corresponding circuit logic is like this equally, as long as the CPU running frequency is enough high, Cell just can obtain surprising floating-point usefulness.

At present, heterogeneous multi-nucleus processor mainly contains three classes, according to the difference of instruction set, has that instruction set is identical but hardware configuration different, the complete difference of instruction set and part instruction set core equal kind mutually.Wherein for the heterogeneous multi-nucleus processor that the universal instruction set core is arranged, its separately different partial design be suitable for processing the application of different demands for the instruction set of expansion.In order to catch up with the developing steps of heterogeneous multi-nucleus processor, disposal route in the time of need to designing more advanced compiling system and operation towards heterogeneous multi-nucleus processor the time comprises the design of programming language, compilation model and Runtime Library etc.In this respect, in the main working set in the optimization of task scheduling strategy and Thread-Level Parallelism.

The static scheduling of existing task mainly contains two kinds of methods: a kind of is at compiling duration different code compilations to be become to be fit to the binary code that different processing nuclear is carried out; Another kind is the Runtime Library of link different disposal nuclear when operation.These two kinds of methods all are statically task assignment upward to be carried out to specific processing nuclear, therefore require program just to consider the isomerism of different IPs when writing, and the programming personnel must examine have fully to isomery and understand.The method is not considered the problems such as the load balancing of heterogeneous polynuclear and power consumption in addition, and the improper distribution of task may make the program implementation time increase, and the informational needs such as load balancing just can be known when operation.

Find from the retrieval of prior art document, problem during for the complexity that overcomes this programming and operation, thread-level dynamic dispatching method has appearred, existing this method is carried out varigrained thread dividing to task, so that different threads can carry out on different cores concurrently, and then improved execution efficient.But the method need to be considered data dependency, in fact be difficult to the procedure division that serial is carried out is become the multithreading task of executed in parallel, in addition because the data dependency of different threads, need also to consider that data reach the problems such as buffer consistency synchronously, may make the thread scheduling more complex, cause the performance of program further to descend.

Summary of the invention

The object of the invention is to overcome above-mentioned deficiency of the prior art, a kind of method of executing dynamic allocation command on embedded heterogeneous multi-core is provided.The present invention according to the factors such as the processing power of system's multinuclear and loading condition dynamically distribution instruction to each heterogeneous polynuclear, carry out, thereby remedied static scheduling can not the dynamic assignment resource deficiency, also reduce the complicacy of dynamic thread dividing, thereby further improved the execution efficient of program on heterogeneous polynuclear.

The present invention is achieved by the following technical solutions, the present invention includes following steps:

The first step in being written into the process of binary code program, is divided processing to binary code program, obtains some fundamental blocks.

Described division is processed, be with the program between i bar entry instruction and the jump instruction of i+1 bar as i+1 fundamental block, wherein: 0≤i, entry instruction be the instruction that forwards to of jump instruction or closely follow jump instruction after instruction.

Second step is selected to process to each fundamental block respectively, obtains carrying out the target processing core of each fundamental block.

Described selection is processed, and may further comprise the steps:

2.1) obtain with i fundamental block in j bar instruction A have the instruction A* of same function;

2.2) obtain the processing nuclear A of all processing instruction A or instruction A*, and then processed simultaneously the processing nuclear set I of every instruction in i the fundamental block;

2.3) according to p=1/N, obtain processing the execution performance of every instruction of each processing nuclear execution among the nuclear set I, and each is processed the execution performance addition that nuclear is carried out every instruction, obtain the average behavior of i fundamental block of each processing nuclear execution in this set, wherein: p is the execution performance that processor is carried out instruction, and N is that this processor is carried out the required instruction cycles of this instruction;

2.4) according to n=P/L, obtain the performance load ratio that each processes nuclear, wherein: n is the performance load ratio, and L is the running load of this processing nuclear, and P is the average behavior that this processing nuclear is carried out i fundamental block;

2.5) from process nuclear set I, select performance load than the target processing core of maximum processing nuclear as i fundamental block.

The 3rd step, the corresponding fundamental block of target processing core that second step obtains is translated processing, obtain the binary code after this target processing core translation.

Described translation is processed, and may further comprise the steps:

3.1) when having the binary code translated to this target processing core in the high-speed cache, directly carry out 3.3); Otherwise, carry out 3.2) after carry out again 3.3);

3.2) corresponding binary code when this fundamental block is translated into dynamically this target processing core and processed;

3.3) when there are data dependency in binary code and the preorder fundamental block carried out, carry out 3.4) and after carry out again 3.5); Otherwise, directly carry out 3.5);

3.4) after the preorder fundamental block carry out to finish, processed accordingly the execution result on the nuclear, when the preorder fundamental block is assigned to other and processes that nuclear is upper to be carried out, switched system context then;

3.5) carry out binary code corresponding to this fundamental block in this target processing core.

In the 4th step, the execution frequency of adding up each fundamental block is labeled as the focus fundamental block with carrying out the fundamental block of frequency greater than threshold value T, and the binary code buffer memory after the focus fundamental block translated is to the high speed buffer memory.

The 5th step, return second step, carry out the dynamic dispatching of next fundamental block.

Compared with prior art, the invention has the beneficial effects as follows: need not fully to understand the instruction feature of heterogeneous polynuclear, therefore compiler has also obtained the platform transparency, need to or not compile different binary codes for different architectural framework codings.Simultaneously, the method is in having the heterogeneous multi-core system of universal instruction set, the function of performance dynamic instruction scheduling, according to the factors such as the processing power of system's multinuclear and loading condition dynamically distribution instruction to each heterogeneous polynuclear, carry out, thereby remedied static scheduling can not the dynamic assignment resource deficiency, also reduce the complicacy of dynamic thread dividing, thereby further improved the execution efficient of program on heterogeneous polynuclear.This method does not need to carry out thread dividing, in the situation that countlessly can farthest walk abreast according to relying on, has therefore reduced about 15% thread synchronization expense.

Description of drawings

Fig. 1 is the method flow schematic diagram of embodiment.

Embodiment

Below in conjunction with accompanying drawing method of the present invention is further described: the present embodiment is implemented under take technical solution of the present invention as prerequisite, provided detailed embodiment and concrete operating process, but protection scope of the present invention is not limited to following embodiment.

Embodiment

As shown in Figure 1, the present embodiment may further comprise the steps:

According to the jump instruction of binary program code, define entry instruction and the exit instruction of fundamental block in the present embodiment, each fundamental block is one group of instruction sequence of sequentially carrying out, and entry instruction is divided into a fundamental block to the instruction code between exit instruction.The entry instruction of fundamental block is article one instruction of fundamental block, i.e. instruction after the jump instruction or the instruction that jumps to, and exit instruction is a upper instruction or the jump instruction of other fundamental block entry instruction.Partial ordering relation before and after keeping between the different fundamental blocks.

Described selection is processed, and may further comprise the steps:

2.1) obtain with i fundamental block in j bar instruction A have the instruction A of same function ^*

2.2) obtain all processing instruction A or instruction A ^*Processing nuclear A, and then the processing nuclear of being processed simultaneously every instruction in i the fundamental block is gathered I;

Described translation is processed, and may further comprise the steps:

Utilize the binary translation technology former binary code to be mapped to the instruction of target processing core in the present embodiment, have two kinds of situations: a kind of situation is that the structure of the instruction instruction that is mapped to target processing core is similar, there is one-to-one relationship in they, and can finish same function; Another kind of situation be have on the target processing core one or one group of instruction that more has superiority possess with former fundamental block in the function of one group of instruction equivalence, such as the AVX (senior vector extension instruction set) of Intel.Instruction after the translation or instruction group have higher execution efficient or power dissipation ratio.

The present embodiment can directly take out the frequent code of carrying out from high-speed cache, thereby greatly reduces the expense of code translation mapping.

The present embodiment is in the process of instruction-level dynamic optimization, the selection of target processing core, the translation of binary command mapping and the execution composition pipeline organization of processing the upper code of nuclear, obviously reduced the handover overhead of command assignment to heterogeneous polynuclear, compare with dynamic thread-level dispatching method, this method does not need to carry out thread dividing, in the situation that countlessly can farthest walk abreast according to relying on, therefore reduced about 15% thread synchronization expense.In addition, the target processing core code after the employing caches translations has further reduced expense, has promoted system performance, and the chip of processing nuclear at a kind of DSP of having carries out the computing of video compress routine, has on average reached 1.8 times speed-up ratio.

Claims

1. the method for an executing dynamic allocation command on embedded heterogeneous multi-core is characterized in that, may further comprise the steps:

The first step in being written into the process of binary code program, is divided processing to binary code program, obtains some fundamental blocks;

Second step is selected to process to each fundamental block respectively, obtains carrying out the target processing core of each fundamental block, specifically may further comprise the steps:

2.2) obtain all processing instruction A or instruction A ^*Processing nuclear B, and then the processing nuclear of being processed simultaneously every instruction in i the fundamental block is gathered I;

2.5) from process nuclear set I, select performance load than the target processing core of maximum processing nuclear as i fundamental block;

The 3rd step, the corresponding fundamental block of target processing core that second step obtains is translated processing, obtain the binary code after this target processing core translation;

In the 4th step, the execution frequency of executed each fundamental block of statistics is labeled as the focus fundamental block with carrying out the fundamental block of frequency greater than threshold value T, and the binary code buffer memory after the focus fundamental block translated is to the high speed buffer memory;

The 5th step, return second step, carry out the dynamic dispatching of next fundamental block; Division described in the first step is processed, be with the program between i bar entry instruction and the jump instruction of i+1 bar as i+1 fundamental block, wherein: 0≤i, entry instruction be the instruction that forwards to of jump instruction or closely follow jump instruction after instruction;

2. the method for executing dynamic allocation command on embedded heterogeneous multi-core according to claim 1 is characterized in that, the translation described in the 3rd step is processed, and may further comprise the steps:

3.4) after the preorder fundamental block carry out to finish, processed accordingly the execution result on the nuclear, when the binary code of preorder fundamental block is assigned to when carrying out on other target processing core, switched system context then;