CN112527394A

CN112527394A - Depth dependence problem parallel method based on instruction sequence and message sequence guidance

Info

Publication number: CN112527394A
Application number: CN201910879931.6A
Authority: CN
Inventors: 陈鑫; 陈德训; 刘鑫; 李芳�; 徐金秀; 孙唯哲; 郭恒; 王臻
Original assignee: Wuxi Jiangnan Computing Technology Institute
Current assignee: Wuxi Jiangnan Computing Technology Institute
Priority date: 2019-09-18
Filing date: 2019-09-18
Publication date: 2021-03-19

Abstract

The invention discloses a depth dependence problem parallel method based on instruction sequence and message sequence guidance, which comprises the following steps: s1, dividing the solution vector according to the block averagely, and converting the dependency among the elements in the solution vector into the dependency among vector blocks through the division; s2, the completion of the calculation of a vector chunk is referred to as an update operation, and the update operation requires the following three steps: s21, partial updating: receiving data sent by a previous block; s22, self-updating: performing calculation updating on elements in the block; s23, updating is completed: the solved elements of the vector block are sent to the dependent successor blocks; s3, sequentially calculating corresponding vector blocks by each calculation core in the many cores, and repeating the updating operation step of S2 to realize the calculation of the whole pipeline operation; s4, designing a string of instruction streams, namely instruction sequences, for each computing core. The invention improves the mutual cooperation of the many-core internal communication, reduces the access time overhead and realizes effective acceleration.

Description

Depth dependence problem parallel method based on instruction sequence and message sequence guidance

Technical Field

The invention belongs to the technical field of numerical calculation, and particularly relates to a depth dependence problem parallel method based on instruction sequence and message sequence guidance.

Background

The depth dependence problem is a common problem in the field of CFD, for example, a common sparse lower trigonometric equation solution, and there is a depth dependence relationship between solutions thereof, that is, a later solution depends on a previous solution to be solved, so that the solution of the problem is a very serial operation. Even if the serial operation can also discover the parallel part, at present, aiming at the problem of deep dependence among solution vector elements in the unstructured grid, a layered parallel idea is mainly adopted, namely, solution vectors x are layered according to whether correlation exists among the solution vectors x, unknowns belonging to the same layer can be solved in parallel, the layers are executed in series, and in order to ensure the correctness of the calculation time sequence, synchronization is needed after each layer is calculated.

The main defects of the hierarchical parallel algorithm are that the dependency hierarchical relationship among solutions needs to be found out firstly, the solution vector x is rearranged according to the hierarchy, and the sparse matrix needs to be correspondingly processed to participate in the calculation. After obtaining the result, the solution vector needs to be converted into the original solution vector x again, which is equivalent to two reordering, and this part of time overhead is unavoidable. In addition, the algorithm needs each layer to have enough parallelism, the synchronization overhead can be small enough, if almost every element xi in the vector depends on the previous element xi-1 (which is called as strong dependency, and vice versa, it is weak dependency), and each layer has only a few elements, the performance of the algorithm is very poor, and it is difficult to parallel effectively. On the other hand, the algorithm has a defect in data reusability, and due to the depth dependence, the algorithm cannot cache required data in advance and only can perform fine-grained discrete access. Assuming that the element xi in the x vector depends on xj, and the distance between i and j is large (i.e. i-j > >1, we refer to as a deep dependency relationship, and vice versa), in this case, the algorithm cannot be cached in advance, and only fine-grained discrete access can be performed. In addition, the many-core internal communication adopts a register mode to carry out communication, the many-core internal communication with non-same rows and non-same columns is not supported, due to the random discreteness of the unstructured grid sparse matrix, many-to-many communication is possible in the many-core internal, so that a communication ring is generated in the many-core internal at a high probability, and the communication deadlock problem is generated due to the limited number of register buffers.

The non-structural grid problem causes irregular storage of non-zero elements of the sparse matrix due to the disorder of data. Through practical example tests, data under an unstructured grid often have a depth dependence relationship, and effective acceleration cannot be realized on the existing Shenwei platform aiming at the problems.

Disclosure of Invention

The invention aims to provide a depth dependence problem parallel method based on instruction sequence and message sequence guidance, which improves the mutual cooperation of many-core internal communication, reduces the access time overhead and realizes effective acceleration.

In order to achieve the purpose, the invention adopts the technical scheme that: a depth dependence problem parallel method based on instruction sequence and message sequence guidance is oriented to a non-network structure and comprises the following steps:

s1, evenly dividing the solution vector according to blocks according to the load balance calculated among the secondary cores, determining the number of each vector block, and converting the dependency among elements in the solution vector into the dependency among the vector blocks;

s2, the completion of the calculation of a vector chunk is referred to as an update operation, and the update operation requires the following three steps:

s21, partial updating: the current vector block receives data sent by the previous vector block through register communication, wherein the initial vector block can be directly solved;

s22, self-updating: the current vector block uses the data transmitted by the previous vector block to solve the elements in the current vector block, namely the solution vector corresponding to the vector block;

s23, updating is completed: the current vector block sends the solved elements to the dependent subsequent vector block;

s3, sequentially calculating corresponding vector blocks by each calculation core in the many cores, and repeating the updating operation step of S2 to realize the calculation of the whole pipeline operation;

s4, creating an instruction sequence for each computing core based on the time sequence, wherein the instruction sequence is strictly executed according to the time sequence, and the many-core communication model based on the instruction sequence and the message sequence is as follows:

s41, creating an instruction sequence based on the time sequence to enable the operation instruction of the updating operation to be executed according to the time sequence, wherein the instruction sequence comprises row sending and receiving times, column sending and receiving times and forwarding times;

s42, creating a message sequence based on the instruction sequence, and strictly executing the message processing corresponding to the operation instruction according to the order of the instruction sequence, wherein the message sequence comprises row sending and receiving messages, column sending and receiving messages and forwarding messages;

s5, sequentially finishing calculation and updating by the vector blocks, wherein at the same time, only one calculation core is in a data sending state, and other calculation cores are in a data receiving state or a calculation state;

s6, for the sending core, i.e. the computing core in the data sending state, the priority of sending data is as follows:

s61, sending the data in the vector blocks in the same line as the vector blocks, and sending the data in the sequence of increasing the number of the vector blocks;

s62, sending the data in the vector blocks in different rows and different columns, and sending the data in the sequence of increasing the number of the vector blocks;

s63, sending the data in the vector blocks in the same column as the vector blocks, and sending the data in the sequence of increasing the number of the vector blocks;

s64, processing the vector block data needed to be processed by the sending core, namely calculating the partial solution of the current vector block according to the received data of the previous vector block;

s7, for the receiving core, i.e. the computing core in the data receiving state, the priority of receiving data is as follows:

s71, receiving data which needs to be processed, namely the data used for calculating the current vector chunk solution from the previous vector chunk;

s72, forwarding data which do not belong to self processing, namely solving the data which are not needed by the current vector block;

s8, for the computation core, i.e. the computation core in the computation state, the data processing is performed in the following order:

s81, processing the data needed immediately, the line number of the preceding vector block can correspond to the non-zero column number of the vector block, namely the data needed immediately;

and S82, buffering the temporarily unnecessary data to a buffer area, namely, the slave core station.

The technical scheme of further improvement in the technical scheme is as follows:

1. in the above scheme, the partial update in S2 is performed concurrently, and the self-update is performed serially.

2. In the above solution, the data in step S21 includes the solved value of the preceding vector chunk and the corresponding location information.

3. In the above scheme, the subsequent vector chunk can start to be calculated only after all the vector chunks in front of the subsequent vector chunk are calculated and elements required by the calculation are received.

Due to the application of the technical scheme, compared with the prior art, the invention has the following advantages:

the invention relates to a depth dependence problem parallel method based on instruction sequence and message sequence guidance, which aims at the common depth dependence problem (such as sparse lower triangular solution) in the field of scientific computing, abandons the traditional idea of hierarchical parallel, reduces the preprocessing of data in a hierarchical parallel algorithm, realizes local parallel pipeline operation based on the instruction sequence and the message sequence, improves the mutual cooperation of many-core internal communication, avoids the communication deadlock problem caused by the communication of many-core internal registers, saves a large amount of overhead time of extra communication, synchronization and the like, and greatly improves the efficiency of many-core internal communication and optimization efficiency.

Drawings

FIG. 1 is a flow chart of the pipeline calculation of the present invention;

FIG. 2 is a diagram of the structure of the instruction sequence and message sequence of the present invention;

FIG. 3 is a flow chart of the depth dependence problem parallel method of the present invention.

Detailed Description

The invention is further described below with reference to the following examples:

example (b): a depth dependence problem parallel method based on instruction sequence and message sequence guidance is oriented to a non-network structure and comprises the following steps:

s5, sequentially finishing calculation and updating by the vector blocks, wherein at the same time, only one calculation core (slave core) is in a data sending state, and other calculation cores are in a data receiving state or a calculation state;

The partial update in S2 described above is performed concurrently, and the self-update is performed serially.

The data in the above step S21 includes the solved values of the preceding vector chunks and the corresponding position information.

The latter vector chunk can not start to be calculated until all the vector chunks in front of the former vector chunk are calculated and the elements required by the calculation are received.

The above-mentioned aspects of the invention are further explained as follows:

1. local parallel pipeline operation

In order to reduce the preprocessing of data in the hierarchical parallel algorithm, a solution vector is divided evenly according to blocks, the dependency among elements in the solution vector is converted into the dependency among vector blocks, namely, the following vector block depends on the fact that the preceding vector block is completely calculated and can start to be calculated after receiving elements required by calculation. We refer to one vector chunk completion calculation as an update operation. As shown in fig. 1, one update operation needs to be completed in three steps:

step 1, partial updating: receiving data sent by a previous block;

step 2, self-updating: performing calculation updating on elements in the block;

and step 3, updating: and sending the solved elements of the vector block to the dependent subsequent blocks.

And (4) sequentially calculating corresponding vector blocks by each calculation core in the multi-core, and repeating the steps 1-3 to realize the whole pipeline operation calculation. It is noted that partial updates are performed concurrently, self-updates are performed serially, and the corresponding pseudo code is as follows:

1. algorithm 1: pipelined operation-local parallel algorithm description +

2. … …

3. for all pi，where 0≤i<p do

4. block _ num = pi; v/initialization

5. while（block_num<block_size）

6. Download _ block (block _ num); v/initialization

7. Recv _ data (); // receiving Forward Block data

8. Self _ update (); // self-refresh

9. Send _ data (); // send to successor block

10. block _ num + = 64; // processing the next block

11. endwhile

12. endfor

13. … …

2. Instruction sequence and message sequence implementation

The key to implementing the optimization algorithm is how to efficiently implement vector chunks, i.e., many-core internal communication. In the SW26010 many-core processor, the many-core internal uses the register mode to communicate, and due to the random discreteness of the unstructured grid sparse matrix, the many-to-many communication possibility exists in the many-core internal, which causes a communication ring to be generated in the many-core internal at a high probability, and a communication deadlock problem to be generated due to the limited number of register buffers. In the process of calculating and updating each vector block, the structure of the sparse matrix is unchanged, and only the value in the matrix is changed, so that the number of times and information of communication in the many cores can be predicted, and the problem of communication deadlock is avoided by converting random communication into ordered communication.

In order to realize ordered communication, a string of instruction streams, namely instruction sequences, are designed for each computing core, and the computing cores execute related operations in strict sequence according to the given instruction streams, so that the problem of communication deadlock is avoided, and overhead time such as additional communication, synchronization and the like is saved.

The many-core communication model based on instruction sequences and message sequences is as follows:

1. and creating an instruction sequence based on the time sequence, and ensuring that the operation instructions are strictly executed according to the time sequence.

2. And creating a message sequence based on the instruction sequence, and ensuring that the message processing corresponding to the instruction is strictly executed according to the instruction sequence.

Due to the pipeline operation, the vector blocks sequentially complete the calculation updating, so that only one calculation core is in a data sending state, and other calculation cores are in a data receiving state or a calculation state at the same time. For a sending core, the priority of sending data is as follows:

a) sending the data in the vector blocks in the same line as the vector blocks, and sending the data according to the ascending sequence of the block numbers;

b) sending the data in the vector blocks in different rows and different columns, and sending the data according to the ascending sequence of the block numbers;

c) and transmitting the data in the x vector blocks in the same column as the x vector blocks, and transmitting the data in the ascending order of the block numbers.

d) And processing the x vector block data which needs to be processed by the sending core.

For the receiving core, the priority of receiving data is as follows:

a) receiving data required to be processed;

b) and forwarding data which does not belong to the self-processing.

For the computing core, the data processing is performed according to the following sequence:

a) processing the data which is needed immediately;

b) temporarily unneeded data is buffered.

The structure diagram of the instruction sequence and the message sequence is shown in fig. 2, and the pseudo code thereof is implemented as follows:

1. algorithm 2: parallel optimization algorithm description based on instruction sequence and message sequence

2. for all pi，where 0≤i<p do

3. Read _ inststruct (); // read instruction sequence to inst array

4. Read _ info (); v/according to the instruction, read the message sequence to the info array

5. Read _ data (); v/reading the data to be processed according to the instruction

6. for block ← 0 to block_size do

7. flag = deciding which block to run to

8. if flag = = send core:

9. size = inst[i].send_row_size；

10. for（i=0；i<size；i++）{

11. sendr (info [ i ] length, info [ i ] id); v/Send line message and Forwarding message }

12. size += inst[i].send_col_size；

13. for（；i<size；i++）{

14. sendc (info [ i ]. length, info [ i ]. id); // send column message }

15. size += inst[i].self_handle_size；

16. for（；i<size；i++）{

17. self _ handle (); v/processing data sent to itself }

18. if flag = = same line as the sending core:

19. size = inst[i].recv_row_ size；

20. for（i=0；i<size；i++）{

21. recvr (info [ i ]. length); // receive line message }

22. size += inst[i].send_col_size；

23. for（；i<size；i++）{

24. recvr (info [ i ]. length); // receiving messages to be forwarded

25. sendc (info [ i ]. length, info [ i ]. id); // forward message }

26. else (4):

27. size = inst[i].recv_col_size；

28. for（i=0；i<size；i++）{

29. recvc (info [ i ]. length); // receive column message }

30. ALLSYN; // hardware synchronization

31. endfor

32. endfor

Through the optimization, the communication deadlock problem is avoided, the internal communication efficiency of the many-core is greatly improved, and actual subject tests show that the acceleration effect of the method is more than 3 times on average and 4.68 times at most compared with a serial algorithm on a main core under the grid scale of 10 ten thousand to 100 ten thousand.

When the depth dependence problem parallel method based on instruction sequence and message sequence guidance is adopted, aiming at the common depth dependence problem (such as sparse lower triangular solution) in the field of scientific computing, the traditional concept of hierarchical parallel is abandoned, preprocessing of data in a hierarchical parallel algorithm is reduced, local parallel pipeline operation is realized based on the instruction sequence and the message sequence, the mutual cooperation of many-core internal communication is improved, the problem of communication deadlock caused by the communication of many-core internal registers is avoided, a large amount of overhead time of extra communication, synchronization and the like is saved, and the many-core internal communication efficiency and the optimization efficiency are greatly improved.

To facilitate a better understanding of the invention, the terms used herein will be briefly explained as follows:

the problem of depth dependence: in the solution of some unstructured grid problems, the english is Depth dependency, that is, the following solution depends on the previous solution to be solved, for example, the common sparse lower trigonometric equation is solved. Assuming that the element xi in the x vector depends on xj, and the distance between i and j is large, i.e. i-j > >1, we refer to the depth dependency relationship, in which case, only the corresponding data can be accessed and stored in a fine-grained discrete manner.

The above embodiments are merely illustrative of the technical ideas and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims

1. A depth dependence problem parallel method based on instruction sequence and message sequence guidance is characterized in that: the non-network oriented structure comprises the following steps:

2. The instruction sequence and message sequence based depth dependent problem parallelism method according to claim 1, characterized in that: the partial update in S2 is performed concurrently, and the self-update is performed serially.

3. The instruction sequence and message sequence based depth dependent problem parallelism method according to claim 1, characterized in that: the data in step S21 includes the solved values of the preceding vector chunks and the corresponding location information.

4. The instruction sequence and message sequence based depth dependent problem parallelism method according to claim 1, characterized in that: the latter vector chunk can not start to calculate until all the vector chunks in front of the former vector chunk are calculated and the elements required by the calculation are received.