CN101561766A

CN101561766A - Low-expense block synchronous method supporting multi-core assisting thread

Info

Publication number: CN101561766A
Application number: CNA2009100856020A
Authority: CN
Inventors: 古志民; 郑宁汉; 张轶; 黄艳; 唐洁; 刘昌定; 陈嘉; 周伟峰; 张博
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2009-05-26
Filing date: 2009-05-26
Publication date: 2009-10-21
Anticipated expiration: 2029-05-26
Also published as: CN101561766B

Abstract

The invention relates to a low-expense block synchronous method supporting a multi-core assisting thread, which belongs to the technical field of multi-core computers. The method reduces data missing in the execution of a calculation thread, reduces the pollution to a shared cache, improves the execution performance of the calculation thread and achieves heteronuclear synergic irregular data push by introducing a mechanism of pre-acting and low-expense block synchronization and cycle control for a prefetch assisting thread aiming at the problem of irregular data missing in multi-core application on the basis of a multi-core structure for the sharing cache. The method can be widely applied to optimizing a multi-core compiler and the database performance in the future.

Description

A kind of block synchronization method of supporting the low expense of multi-core assisting thread

Technical field

The present invention relates to a kind of block synchronization method of supporting the low expense of multi-core assisting thread, belong to the multi-core computer technical field.

Background technology

Chip multi-core processor (Chip Multi-Processor) technology be with a plurality of calculating inner core organic integration in processor chips, utilize multithreading, improve a technology of the executed in parallel performance of application program.According to the Amdahl law, the performance that program parallelization is carried out is determined by the performance that its serial is partly carried out, and the expense that the long delay memory access causes in the serial part seriously affects the performance of application program.

Usually, the chip multi-core processor framework has shared L2 cache (Level 2 Cache) or afterbody buffer memory (Last level Cache).But the regular data in the conventional hardware prefetching technique application programs (as regular array) is looked ahead, and deliver in the shared buffer memory in advance, when the current computational threads of application program has access to these regular data, often in shared buffer memory, desired data will be had access to, and memory access need not be carried out again.Yet, to the discontinuous non-regular data in address (as the non-rule visit of non-regulation linked, array), because the uncontinuity of visit data address, the conventional hardware prefetching technique just can't accurately get access to the address information of prefetch data, so it does not have the effect of looking ahead.In this case, the look ahead method of assisting thread is suggested.This method is that computational threads extracts its assisting thread of looking ahead, and by using idle nuclear, make the assisting thread of looking ahead dynamically remain on computational threads visit data before, so that these data in time were pushed to before being visited by computational threads among the shared buffer memory, to improve the serial code execution performance.

Compiler or programmer can be application program and generate the computational threads and the assisting thread of looking ahead, the assisting thread of looking ahead calculates operation in the nuclear at one, computational threads is calculated operation in the nuclear at another, and the needed non-regular data of computational threads can be pushed in the shared buffer memory by prefetched assisting thread.Usually, computational threads and look ahead and need between assisting thread to work in coordination with, with the assisting thread that prevents to look ahead carry out too fast, cause the data of looking ahead too early, needing to occur the time spent may be buffered the situation that algorithm is replaced of replacing; The assisting thread that also will prevent to look ahead is too slow, and computational threads was visited, makes the situation that prefetch data is out of use.Computational threads need be known the implementation status of its assisting thread of looking ahead, otherwise the assisting thread of looking ahead also needs to know the implementation status of its computational threads.In order to determine the other side's run location, in the computational threads and the assisting thread of looking ahead, need to add synchronous operation.Present conventional synchronization method, under the cost of not considering thread scheduling, for data access each time, the assisting thread of looking ahead all can carry out one subsynchronous with computational threads, this precise synchronization mode has been brought serious synchronization overhead, and might offset the assisting thread of looking ahead in the data disappearance that reduces non-rule with reduce performance benefits in the long delay memory access, make the execution performance of computational threads to be improved.

Summary of the invention

The objective of the invention is for overcoming the problems referred to above, look ahead and propose a kind of block synchronization method of supporting the low expense of multi-core assisting thread at non-regular data.Its basic thought is: on the multicore architecture basis of shared buffer memory, problem at non-regular data disappearance in the multinuclear application, by introduce the piece synchronization mechanism of lead and low expense for the assisting thread of looking ahead, the data that reach when reducing the computational threads execution lack, reduce look ahead assisting thread and the synchronous expense of computational threads, improve the purpose of computational threads execution performance.The present invention can be widely used in multinuclear Compiler Optimization and database performance optimization etc.

In order to explain term implication relevant in the related step of our method, at first provide the definition of these technical terms:

Definition 1: the current position of looking ahead

In a computational threads code, the address of the non-regular data that current needs are looked ahead is called the current position of looking ahead;

Definition 2: the amount of calculation of the current position of looking ahead

In a computational threads code, current look ahead position and next code execution time of looking ahead between the position are called the amount of calculation of the current position of looking ahead.Wherein, if this time is 0, belong to the situation that does not have amount of calculation; If this time is very little,, belong to the situation of few amount of calculation as less than tens clock period;

Definition 3: the calculating burst of computational threads

In a computational threads, the code zone that will contain a large amount of non-regular data disappearances is called the calculating burst of computational threads;

Definition 4: the data stream of shared buffer memory disappearance

Calculating burst to a computational threads, if it has caused the data disappearance of shared buffer memory continuously in a large number, use miss1, miss2, missN represents the address of missing data, claims that then the pairing data access sequence of address stream that forms from miss1 to missN is the data stream of shared buffer memory disappearance.

Definition 5: historical address information

In thread is looked ahead in computational threads or help, carry out redirect efficiently in order to make its associated pointers, need partly remember the address of those jump cursor, we just call historical address information to the address of these reservations.Specifically, to a chained list need to preserve (the individual pointer (fraction part needs carry to round) of chained list length/k), i.e. head pointer, k+1 pointer, 2k+1 pointer etc., the subscript of array also can be regarded the special case of this situation as in addition.Here, k is a positive integer, can try to achieve by the step 1 in the following specific implementation step.

The general frame design cycle of a kind of low expense block synchronization method of supporting multi-core assisting thread of the present invention as shown in Figure 1, the specific implementation step is as follows:

The look ahead lead of assisting thread of step 1, structure

On the basis of above-mentioned relational language definition, construct the lead of the assisting thread of looking ahead.Basic thought is: when looking ahead non-regular data, by existing historical address information, dynamically keep looking ahead the work of the looking ahead pointer of assisting thread always in advance in k position of work at present pointer of computational threads, like this, be when being in beginning or synchronous regime no matter in the computational threads and the assisting thread of looking ahead, the assisting thread of looking ahead can dynamically remain on computational threads visit data before, and these data can in time be pushed among the shared buffer memory before being visited by computational threads.Its main constitution step is as follows:

(1) step: in computational threads, the amount of calculation of the current position of looking ahead is estimated, if belong to the situation that does not have amount of calculation or few amount of calculation, changeed for (2) step, otherwise changeed for (4) step;

(2) step: assisting thread is led over the computational threads Δ t time in order to guarantee to look ahead, the current position of looking ahead in computational threads, find the position of Δ t code calculated amount in advance, and in the assisting thread code of looking ahead, adjust the current work pointer of looking ahead and (be adjusted into this subscript value for the subscript of array and add k in advance in k position of work at present pointer of computational threads; For chained list, the work pointer of looking ahead is adjusted into k pointer behind the current chain list index); Here Δ t and k have the following formula relation:

Δt＝f(k)＝k*MissPenalty+c0

Wherein: k represents a positive integer, can determine according to estimated value or the measured value of Δ t;

MissPenalty represents the expense of a long delay memory access;

C0 represents the constant value of a setting;

(3) step:, changeed for (4) step when the burst end of this data push of computational threads; When synchronous, changeed for (2) step;

(4) step: finish.

The piece synchronization mechanism of step 2, the low expense of selection

Construct in step 1 on the lead basis of the assisting thread of looking ahead, step 2 is selected a kind of piece synchronization mechanism of low expense.

The piece synchronization mechanism that should hang down expense is divided into two kinds of situations, can select one of them according to test case:

A. the piece synchronization mechanism of double counters

The basic thought of the piece synchronization mechanism of low expense is: the piece synchronization mechanism will cause the data access stream of shared buffer memory disappearance in the computational threads, be divided into some sequentially, the size of each piece is set occurrence according to the selection test effect situation of application example; And synchronous operation only occurs over just the border of piece, reduces synchronous cost to reduce synchronous precision; The lead and the dynamic maintenance thereof of assisting thread of looking ahead can be finished by step 2, the size of representing piece for each piece with pushsize, the computational threads and the assisting thread of looking ahead all have the counter of oneself, item number certificate in their every visit shared buffer memory missing data stream, will add 1 to its counter, when the numerical value of counter reaches pushsize, its cross-thread must carry out synchronously, if another thread does not reach same progress with it, this thread must get clogged and wait for up to another thread synchronous with it.The operation steps of piece synchronization mechanism is made up of the calculating Fragmentation step of computational threads and the operation steps of the assisting thread of looking ahead:

1. the calculating burst concrete operations step of computational threads is as follows:

(1) step: beginning;

(2) step: counter puts 0, and computational threads begins to cooperate with the assisting thread of looking ahead;

(3) step: read data, counter adds 1, carries out and calculates;

(4) step: finish if calculate burst, changeed for (6) step;

(5) step:, otherwise changeed for (3) step if Counter Value, changeed for (2) step greater than pushsize;

(6) step: finish.

2. the concrete operations step of assisting thread of looking ahead is as follows:

(1) step: beginning;

(3) step: if synchronous beginning, the work at present pointer of adjusting behind the lead by the thread that helps to look ahead pushes prefetch data, otherwise pushes prefetch data by the new work at present pointer of thread that helps to look ahead, and counter adds 1;

(4) step: finish if calculate, changeed for (6) step;

(6) step: finish.

B. the piece synchronization mechanism of single counter

The basic thought of the piece synchronization mechanism of single counter is: computational threads can not blocked by synchronous, do not add extra synchronous operation yet, the assisting thread of only looking ahead has a counter, when the value of counter reaches pushsize, the assisting thread of looking ahead can be made as the pointer of propelling data and computational threads identical, and the lead of the assisting thread of looking ahead and dynamic maintenance thereof can be finished by step 1.The operation steps of the piece synchronization mechanism of single counter, form by the calculating Fragmentation step of computational threads and the operation steps of the assisting thread of looking ahead:

(1) step: beginning;

(2) step: computational threads begins to cooperate with the assisting thread of looking ahead;

(3) step: read data, carry out and calculate;

(4) step: finish if calculate burst, changeed for (5) step, otherwise changeed for (3) step;

(5) step: finish.

(1) step: beginning;

(3) step: counter puts 0, obtains the work at present pointer of computational threads;

(4) step: if synchronous beginning, the work at present pointer of adjusting behind the lead by the thread that helps to look ahead pushes prefetch data, otherwise pushes prefetch data by the new work at present pointer of thread that helps to look ahead, and counter adds 1;

(5) step: finish if calculate, changeed for (7) step;

(6) step:, otherwise changeed for (4) step if Counter Value, changeed for (3) step greater than pushsize;

(7) step: finish.

The piece synchronization mechanism of single counter can overcome the special circumstances of the piece synchronization mechanism of double counters at the synchronous points obstruction, and it has ensured the sustainable service ability of computational threads.The piece synchronization mechanism of double counters and the piece synchronization mechanism of single counter can in concrete applied environment, application program after the test, use according to qualifications.

Beneficial effect:

1. the present invention adopts selectable synchronization mechanism of low expense, and it comprises that the piece piece synchronous and single counter of double counters is synchronous, has effectively reduced the synchronization overhead that traditional precise synchronization mechanism is caused.All need synchronous traditional precise synchronization mode with looking ahead at every turn, incomparable low expense characteristics are arranged: select suitable Pushsize value, but the buffer memory that the excessive pushsize of active balance causes pollutes, but the also synchronization overhead brought of the too small pushsize of active balance has improved the execution performance of computational threads effectively.The piece synchronization mechanism of single counter can overcome the special circumstances of the piece synchronization mechanism of double counters at the synchronous points obstruction, and it has ensured the sustainable service ability of computational threads.These two kinds of mechanism can in concrete applied environment and application program after the test, be used according to qualifications.

2. introduced lead for the assisting thread of looking ahead, made in the work at present position of computational threads, no matter had or not enough evaluation works, can both allow the piece synchronization mechanism of low expense be carried out effectively.The lead building method can dynamically keep looking ahead the work pointer of assisting thread always in advance in the current pointer K of a computational threads amount of calculation, and it is the realization basis of the piece synchronization mechanism of the synchronous and single counter of the piece of double counters of low expense.Under the situation of less calculated amount, it has computational threads and the assisting thread of looking ahead still can carry out the characteristics that crossover calculates.

Description of drawings

Fig. 1 is a general frame design flow diagram of the present invention;

Embodiment

According to technique scheme, the present invention is described in detail below in conjunction with embodiment.

With following simple program is example, adds the ADDSCALE variable in header file ldsHeader.h, controls the amount of calculation of each node in the chained list by the value that changes this variable, being calculated as of chained list node:

while(iterator){

temp＝iterator-＞i_data；

while(i++＜ADDSCALE){

temp+＝1；

}

res+＝temp；

i＝0；

iterator＝iterator-＞next；

}

Value by continuous change ADDSCALE is come the accommodometer operator workload, is 0 o'clock from ADDSCALE, and the value of each ADDSCALE increases by 5, like this we ADDSCALE is arranged is 0,5,10,15,20 etc.

Give an example in conjunction with above-mentioned, provide relational language and be defined as follows:

Definition 1: the current position of looking ahead

Calculate the current position of looking ahead by following code:

temp＝iterator-＞i_data；

Calculate the amount of calculation of the current position of looking ahead by following code:

while(i++＜ADDSCALE){

temp+＝1；

}

res+＝temp；

i＝0；

Definition 3: the calculating burst of computational threads

Calculate the calculating burst of the computational threads of the current position of looking ahead by following code:

while(iterator){

temp＝iterator-＞i_data；

while(i++＜ADDSCALE){

temp+＝1；

}

res+＝temp；

i＝0；

iterator＝iterator-＞next；

}

Definition 4: the data stream of shared buffer memory disappearance

Iterator in the said procedure, indication data such as iterator-＞next are the data stream of shared buffer memory disappearance.

Definition 5: historical address information

In thread is looked ahead in computational threads or help, carry out redirect efficiently in order to make its associated pointers, need partly remember the address of those jump cursor, we just call historical address information to the address of these reservations.To the iterator chained list in this example, if chained list length=10000, k=20 need to preserve 500 pointers, i.e. head pointer, the 21st pointer, the 41st pointer etc.

The look ahead lead of assisting thread of step 1, structure

(1) step: in computational threads, the workload of the current position of looking ahead is estimated that if ADDSCALE is 0 or 5 or 10, this belongs to the situation that does not almost completely have amount of calculation or less amount of calculation, changeed for (2) step; Otherwise,, changeed for (4) step if ADDSCALE is 15 or 20;

(2) step: assisting thread is led over the computational threads Δ t time in order to guarantee to look ahead, the current position of looking ahead in computational threads, find the position of Δ t code calculated amount in advance, and in the assisting thread code of looking ahead, adjust the current work pointer of looking ahead in advance in k position of work at present pointer of computational threads;

As Δ t=6000, the MissPenalty=300 clock period, during c0=0:

k＝Δt/MissPenalty＝6000/300＝20

(3) step: when the burst end of this data push of computational threads, changeed for (4) step, when needs are synchronous, changeed for (2) step;

(4) step: finish.

Selectable synchronization mechanism of step 2, low expense

A. calculating burst P1 with one of main computational threads is example, and its assisting thread of looking ahead is push, and the piece synchronization mechanism process of constructing its double counters is:

1. the calculating burst P1 concrete operations step of main computational threads is as follows:

(1) step: beginning main;

(2) step: counter counter puts 0, sem_post (﹠amp; Main), sem_wait (﹠amp; Push), main computational threads and the push assisting thread of looking ahead begins to cooperate;

(3) step: read data, counter counter adds 1, carries out and calculates;

(4) step: finish if main calculates burst P1, changeed for (6) step;

(5) step:, otherwise changeed for (3) step if counter counter value, changeed for (2) step greater than pushsize;

(6) step: finish.

2. the look ahead concrete operations step of assisting thread of push is as follows:

(1) step: beginning push;

(2) step: counter push_counter puts 0, sem_post (﹠amp; Push), sem_wait (﹠amp; Main), the push thread begins to cooperate with the main thread;

(3) step: if synchronous beginning, push prefetch data by the push work at present pointer that thread adjusts behind the lead that helps to look ahead, otherwise push prefetch data by the new work at present pointer of thread push that helps to look ahead, counter push_counter adds 1;

(4) step: finish if push calculates, changeed for (6) step;

(5) step:, otherwise changeed for (3) step if counter push_counter value, changeed for (2) step greater than pushsize;

(6) step: finish.

B. calculating burst P2 with one of main computational threads is example, and its assisting thread of looking ahead is push.The piece synchronization mechanism process of constructing its single counter is:

1. the calculating burst P2 concrete operations step of main computational threads is as follows:

(1) step: beginning main;

(2) step: sem_post (﹠amp; Main), sem_wait (﹠amp; Push), main computational threads and the push assisting thread of looking ahead begins to cooperate;

(3) step: read data, carry out and calculate;

(4) step: finish if main calculates burst P2, changeed for (5) step, otherwise changeed for (3) step;

(5) step: finish.

(1) step: beginning push;

(2) step: sem_post (﹠amp; Push), sem_wait (﹠amp; Main), the push thread begins to cooperate with the main thread;

(3) step: counter push_counter puts 0, obtains the work at present pointer of main thread;

(4) step: if synchronous beginning, push prefetch data by the push work at present pointer that thread adjusts behind the lead that helps to look ahead, otherwise push prefetch data by the new work at present pointer of thread push that helps to look ahead, counter push_counter adds 1;

(5) step: finish if push calculates, changeed for (7) step;

(6) step:, otherwise changeed for (4) step if counter push_counter value, changeed for (3) step greater than pushsize;

(7) step: finish.

The above-mentioned integration test result who gives an example is as follows:

ADDSCALE 0(k＝20， 5(k＝20， 10(k＝20， 15(k＝0， 20(k＝0，

(scale variable) pushsize=600) pushsize=600 pushsize=600 pushsize=600 pushsize=600))))

There is not the assisting thread 120.175 121.627 135.234 153.057 171.058 of looking ahead

Execution time

The assisting thread 80.117 82.474 89.839 118.076 117.697 of looking ahead

Execution time

(the inventive method)

Can see use the inventive method by test result, the program implementation time obviously shortens.

Claims

1. block synchronization method of supporting the low expense of multi-core assisting thread, it is characterized in that its basic thought is on the multicore architecture basis of shared buffer memory, problem at non-regular data disappearance in the multinuclear application, by introduce the piece synchronization mechanism of lead and low expense for the assisting thread of looking ahead, the data that reach when reducing the computational threads execution lack, reduce look ahead assisting thread and the synchronous expense of computational threads, improve the purpose of computational threads execution performance; The specific implementation step is as follows:

The look ahead lead of assisting thread of step 1, structure

When looking ahead non-regular data, by historical address information, dynamically keep looking ahead the work of the looking ahead pointer of assisting thread always in advance in k position of work at present pointer of computational threads, like this, be when being in beginning or synchronous regime no matter in the computational threads and the assisting thread of looking ahead, the assisting thread of looking ahead can dynamically remain on computational threads visit data before, and these data can in time be pushed to before being visited by computational threads among the shared buffer memory;

The piece synchronization mechanism of step 2, the low expense of selection

Construct in step 1 on the lead basis of the assisting thread of looking ahead, step 2 is selected a kind of piece synchronization mechanism of low expense;

The piece synchronization mechanism that should hang down expense is divided into two kinds of situations, can select one of them according to qualifications according to test case:

A. the piece synchronization mechanism of double counters

The piece synchronization mechanism is divided into some sequentially with causing the data access stream of shared buffer memory disappearance in the computational threads, and the size of each piece is set occurrence according to the selection test effect situation of application example; And synchronous operation only occurs over just the border of piece, reduces synchronous cost to reduce synchronous precision; The lead and the dynamic maintenance thereof of assisting thread of looking ahead can be finished by step 1, the size of representing piece for each piece with pushsize, the computational threads and the assisting thread of looking ahead all have the counter of oneself, item number certificate in their every visit shared buffer memory missing data stream, will add 1 to its counter, when the numerical value of counter reaches pushsize, its cross-thread must carry out synchronously, if another thread does not reach same progress with it, this thread must get clogged and wait for up to another thread synchronous with it;

B. the piece synchronization mechanism of single counter

Computational threads can not blocked by synchronous, do not add extra synchronous operation yet, the assisting thread of only looking ahead has a counter, when the value of counter reaches pushsize, the assisting thread of looking ahead can be made as the pointer of propelling data and computational threads identical, and the lead of the assisting thread of looking ahead and dynamic maintenance thereof can be finished by step 1.

2. a kind of block synchronization method of supporting the low expense of multi-core assisting thread according to claim 1 is characterized in that the lead step of constructing the assisting thread of looking ahead in the step 1 is:

Δt＝f(k)＝k*MissPenalty+c0

MissPenalty represents the expense of a long delay memory access;

C0 represents the constant value of a setting;

(4) step: finish.

3. a kind of block synchronization method of supporting the low expense of multi-core assisting thread according to claim 1, it is characterized in that in the piece synchronization mechanism of the low expense of step 2 selection that the piece synchronization mechanism of double counters is made up of the calculating Fragmentation step of computational threads and the operation steps of the assisting thread of looking ahead:

(1) step: beginning;

(3) step: read data, counter adds 1, carries out and calculates;

(4) step: finish if calculate burst, changeed for (6) step;

(6) step: finish;

(1) step: beginning;

(4) step: finish if calculate, changeed for (6) step;

(6) step: finish.

4. a kind of block synchronization method of supporting the low expense of multi-core assisting thread according to claim 1, it is characterized in that in the piece synchronization mechanism of the low expense of step 2 selection that the piece synchronization mechanism of single counter is made up of the calculating Fragmentation step of computational threads and the operation steps of the assisting thread of looking ahead:

(1) step: beginning;

(3) step: read data, carry out and calculate;

(5) step: finish;

(1) step: beginning;

(5) step: finish if calculate, changeed for (7) step;

(7) step: finish.