CN112527394A - Depth dependence problem parallel method based on instruction sequence and message sequence guidance - Google Patents

Depth dependence problem parallel method based on instruction sequence and message sequence guidance Download PDF

Info

Publication number
CN112527394A
CN112527394A CN201910879931.6A CN201910879931A CN112527394A CN 112527394 A CN112527394 A CN 112527394A CN 201910879931 A CN201910879931 A CN 201910879931A CN 112527394 A CN112527394 A CN 112527394A
Authority
CN
China
Prior art keywords
vector
data
sequence
sending
core
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201910879931.6A
Other languages
Chinese (zh)
Inventor
陈鑫
陈德训
刘鑫
李芳�
徐金秀
孙唯哲
郭恒
王臻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Jiangnan Computing Technology Institute
Original Assignee
Wuxi Jiangnan Computing Technology Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Jiangnan Computing Technology Institute filed Critical Wuxi Jiangnan Computing Technology Institute
Priority to CN201910879931.6A priority Critical patent/CN112527394A/en
Publication of CN112527394A publication Critical patent/CN112527394A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3838Dependency mechanisms, e.g. register scoreboarding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
    • G06F9/3869Implementation aspects, e.g. pipeline latches; pipeline synchronisation and clocking

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a depth dependence problem parallel method based on instruction sequence and message sequence guidance, which comprises the following steps: s1, dividing the solution vector according to the block averagely, and converting the dependency among the elements in the solution vector into the dependency among vector blocks through the division; s2, the completion of the calculation of a vector chunk is referred to as an update operation, and the update operation requires the following three steps: s21, partial updating: receiving data sent by a previous block; s22, self-updating: performing calculation updating on elements in the block; s23, updating is completed: the solved elements of the vector block are sent to the dependent successor blocks; s3, sequentially calculating corresponding vector blocks by each calculation core in the many cores, and repeating the updating operation step of S2 to realize the calculation of the whole pipeline operation; s4, designing a string of instruction streams, namely instruction sequences, for each computing core. The invention improves the mutual cooperation of the many-core internal communication, reduces the access time overhead and realizes effective acceleration.

Description

Depth dependence problem parallel method based on instruction sequence and message sequence guidance
Technical Field
The invention belongs to the technical field of numerical calculation, and particularly relates to a depth dependence problem parallel method based on instruction sequence and message sequence guidance.
Background
The depth dependence problem is a common problem in the field of CFD, for example, a common sparse lower trigonometric equation solution, and there is a depth dependence relationship between solutions thereof, that is, a later solution depends on a previous solution to be solved, so that the solution of the problem is a very serial operation. Even if the serial operation can also discover the parallel part, at present, aiming at the problem of deep dependence among solution vector elements in the unstructured grid, a layered parallel idea is mainly adopted, namely, solution vectors x are layered according to whether correlation exists among the solution vectors x, unknowns belonging to the same layer can be solved in parallel, the layers are executed in series, and in order to ensure the correctness of the calculation time sequence, synchronization is needed after each layer is calculated.
The main defects of the hierarchical parallel algorithm are that the dependency hierarchical relationship among solutions needs to be found out firstly, the solution vector x is rearranged according to the hierarchy, and the sparse matrix needs to be correspondingly processed to participate in the calculation. After obtaining the result, the solution vector needs to be converted into the original solution vector x again, which is equivalent to two reordering, and this part of time overhead is unavoidable. In addition, the algorithm needs each layer to have enough parallelism, the synchronization overhead can be small enough, if almost every element xi in the vector depends on the previous element xi-1 (which is called as strong dependency, and vice versa, it is weak dependency), and each layer has only a few elements, the performance of the algorithm is very poor, and it is difficult to parallel effectively. On the other hand, the algorithm has a defect in data reusability, and due to the depth dependence, the algorithm cannot cache required data in advance and only can perform fine-grained discrete access. Assuming that the element xi in the x vector depends on xj, and the distance between i and j is large (i.e. i-j > >1, we refer to as a deep dependency relationship, and vice versa), in this case, the algorithm cannot be cached in advance, and only fine-grained discrete access can be performed. In addition, the many-core internal communication adopts a register mode to carry out communication, the many-core internal communication with non-same rows and non-same columns is not supported, due to the random discreteness of the unstructured grid sparse matrix, many-to-many communication is possible in the many-core internal, so that a communication ring is generated in the many-core internal at a high probability, and the communication deadlock problem is generated due to the limited number of register buffers.
The non-structural grid problem causes irregular storage of non-zero elements of the sparse matrix due to the disorder of data. Through practical example tests, data under an unstructured grid often have a depth dependence relationship, and effective acceleration cannot be realized on the existing Shenwei platform aiming at the problems.
Disclosure of Invention
The invention aims to provide a depth dependence problem parallel method based on instruction sequence and message sequence guidance, which improves the mutual cooperation of many-core internal communication, reduces the access time overhead and realizes effective acceleration.
In order to achieve the purpose, the invention adopts the technical scheme that: a depth dependence problem parallel method based on instruction sequence and message sequence guidance is oriented to a non-network structure and comprises the following steps:
s1, evenly dividing the solution vector according to blocks according to the load balance calculated among the secondary cores, determining the number of each vector block, and converting the dependency among elements in the solution vector into the dependency among the vector blocks;
s2, the completion of the calculation of a vector chunk is referred to as an update operation, and the update operation requires the following three steps:
s21, partial updating: the current vector block receives data sent by the previous vector block through register communication, wherein the initial vector block can be directly solved;
s22, self-updating: the current vector block uses the data transmitted by the previous vector block to solve the elements in the current vector block, namely the solution vector corresponding to the vector block;
s23, updating is completed: the current vector block sends the solved elements to the dependent subsequent vector block;
s3, sequentially calculating corresponding vector blocks by each calculation core in the many cores, and repeating the updating operation step of S2 to realize the calculation of the whole pipeline operation;
s4, creating an instruction sequence for each computing core based on the time sequence, wherein the instruction sequence is strictly executed according to the time sequence, and the many-core communication model based on the instruction sequence and the message sequence is as follows:
s41, creating an instruction sequence based on the time sequence to enable the operation instruction of the updating operation to be executed according to the time sequence, wherein the instruction sequence comprises row sending and receiving times, column sending and receiving times and forwarding times;
s42, creating a message sequence based on the instruction sequence, and strictly executing the message processing corresponding to the operation instruction according to the order of the instruction sequence, wherein the message sequence comprises row sending and receiving messages, column sending and receiving messages and forwarding messages;
s5, sequentially finishing calculation and updating by the vector blocks, wherein at the same time, only one calculation core is in a data sending state, and other calculation cores are in a data receiving state or a calculation state;
s6, for the sending core, i.e. the computing core in the data sending state, the priority of sending data is as follows:
s61, sending the data in the vector blocks in the same line as the vector blocks, and sending the data in the sequence of increasing the number of the vector blocks;
s62, sending the data in the vector blocks in different rows and different columns, and sending the data in the sequence of increasing the number of the vector blocks;
s63, sending the data in the vector blocks in the same column as the vector blocks, and sending the data in the sequence of increasing the number of the vector blocks;
s64, processing the vector block data needed to be processed by the sending core, namely calculating the partial solution of the current vector block according to the received data of the previous vector block;
s7, for the receiving core, i.e. the computing core in the data receiving state, the priority of receiving data is as follows:
s71, receiving data which needs to be processed, namely the data used for calculating the current vector chunk solution from the previous vector chunk;
s72, forwarding data which do not belong to self processing, namely solving the data which are not needed by the current vector block;
s8, for the computation core, i.e. the computation core in the computation state, the data processing is performed in the following order:
s81, processing the data needed immediately, the line number of the preceding vector block can correspond to the non-zero column number of the vector block, namely the data needed immediately;
and S82, buffering the temporarily unnecessary data to a buffer area, namely, the slave core station.
The technical scheme of further improvement in the technical scheme is as follows:
1. in the above scheme, the partial update in S2 is performed concurrently, and the self-update is performed serially.
2. In the above solution, the data in step S21 includes the solved value of the preceding vector chunk and the corresponding location information.
3. In the above scheme, the subsequent vector chunk can start to be calculated only after all the vector chunks in front of the subsequent vector chunk are calculated and elements required by the calculation are received.
Due to the application of the technical scheme, compared with the prior art, the invention has the following advantages:
the invention relates to a depth dependence problem parallel method based on instruction sequence and message sequence guidance, which aims at the common depth dependence problem (such as sparse lower triangular solution) in the field of scientific computing, abandons the traditional idea of hierarchical parallel, reduces the preprocessing of data in a hierarchical parallel algorithm, realizes local parallel pipeline operation based on the instruction sequence and the message sequence, improves the mutual cooperation of many-core internal communication, avoids the communication deadlock problem caused by the communication of many-core internal registers, saves a large amount of overhead time of extra communication, synchronization and the like, and greatly improves the efficiency of many-core internal communication and optimization efficiency.
Drawings
FIG. 1 is a flow chart of the pipeline calculation of the present invention;
FIG. 2 is a diagram of the structure of the instruction sequence and message sequence of the present invention;
FIG. 3 is a flow chart of the depth dependence problem parallel method of the present invention.
Detailed Description
The invention is further described below with reference to the following examples:
example (b): a depth dependence problem parallel method based on instruction sequence and message sequence guidance is oriented to a non-network structure and comprises the following steps:
s1, evenly dividing the solution vector according to blocks according to the load balance calculated among the secondary cores, determining the number of each vector block, and converting the dependency among elements in the solution vector into the dependency among the vector blocks;
s2, the completion of the calculation of a vector chunk is referred to as an update operation, and the update operation requires the following three steps:
s21, partial updating: the current vector block receives data sent by the previous vector block through register communication, wherein the initial vector block can be directly solved;
s22, self-updating: the current vector block uses the data transmitted by the previous vector block to solve the elements in the current vector block, namely the solution vector corresponding to the vector block;
s23, updating is completed: the current vector block sends the solved elements to the dependent subsequent vector block;
s3, sequentially calculating corresponding vector blocks by each calculation core in the many cores, and repeating the updating operation step of S2 to realize the calculation of the whole pipeline operation;
s4, creating an instruction sequence for each computing core based on the time sequence, wherein the instruction sequence is strictly executed according to the time sequence, and the many-core communication model based on the instruction sequence and the message sequence is as follows:
s41, creating an instruction sequence based on the time sequence to enable the operation instruction of the updating operation to be executed according to the time sequence, wherein the instruction sequence comprises row sending and receiving times, column sending and receiving times and forwarding times;
s42, creating a message sequence based on the instruction sequence, and strictly executing the message processing corresponding to the operation instruction according to the order of the instruction sequence, wherein the message sequence comprises row sending and receiving messages, column sending and receiving messages and forwarding messages;
s5, sequentially finishing calculation and updating by the vector blocks, wherein at the same time, only one calculation core (slave core) is in a data sending state, and other calculation cores are in a data receiving state or a calculation state;
s6, for the sending core, i.e. the computing core in the data sending state, the priority of sending data is as follows:
s61, sending the data in the vector blocks in the same line as the vector blocks, and sending the data in the sequence of increasing the number of the vector blocks;
s62, sending the data in the vector blocks in different rows and different columns, and sending the data in the sequence of increasing the number of the vector blocks;
s63, sending the data in the vector blocks in the same column as the vector blocks, and sending the data in the sequence of increasing the number of the vector blocks;
s64, processing the vector block data needed to be processed by the sending core, namely calculating the partial solution of the current vector block according to the received data of the previous vector block;
s7, for the receiving core, i.e. the computing core in the data receiving state, the priority of receiving data is as follows:
s71, receiving data which needs to be processed, namely the data used for calculating the current vector chunk solution from the previous vector chunk;
s72, forwarding data which do not belong to self processing, namely solving the data which are not needed by the current vector block;
s8, for the computation core, i.e. the computation core in the computation state, the data processing is performed in the following order:
s81, processing the data needed immediately, the line number of the preceding vector block can correspond to the non-zero column number of the vector block, namely the data needed immediately;
and S82, buffering the temporarily unnecessary data to a buffer area, namely, the slave core station.
The partial update in S2 described above is performed concurrently, and the self-update is performed serially.
The data in the above step S21 includes the solved values of the preceding vector chunks and the corresponding position information.
The latter vector chunk can not start to be calculated until all the vector chunks in front of the former vector chunk are calculated and the elements required by the calculation are received.
The above-mentioned aspects of the invention are further explained as follows:
1. local parallel pipeline operation
In order to reduce the preprocessing of data in the hierarchical parallel algorithm, a solution vector is divided evenly according to blocks, the dependency among elements in the solution vector is converted into the dependency among vector blocks, namely, the following vector block depends on the fact that the preceding vector block is completely calculated and can start to be calculated after receiving elements required by calculation. We refer to one vector chunk completion calculation as an update operation. As shown in fig. 1, one update operation needs to be completed in three steps:
step 1, partial updating: receiving data sent by a previous block;
step 2, self-updating: performing calculation updating on elements in the block;
and step 3, updating: and sending the solved elements of the vector block to the dependent subsequent blocks.
And (4) sequentially calculating corresponding vector blocks by each calculation core in the multi-core, and repeating the steps 1-3 to realize the whole pipeline operation calculation. It is noted that partial updates are performed concurrently, self-updates are performed serially, and the corresponding pseudo code is as follows:
1. algorithm 1: pipelined operation-local parallel algorithm description +
2. … …
3. for all pi,where 0≤i<p do
4. block _ num = pi; v/initialization
5. while(block_num<block_size)
6. Download _ block (block _ num); v/initialization
7. Recv _ data (); // receiving Forward Block data
8. Self _ update (); // self-refresh
9. Send _ data (); // send to successor block
10. block _ num + = 64; // processing the next block
11. endwhile
12. endfor
13. … …
2. Instruction sequence and message sequence implementation
The key to implementing the optimization algorithm is how to efficiently implement vector chunks, i.e., many-core internal communication. In the SW26010 many-core processor, the many-core internal uses the register mode to communicate, and due to the random discreteness of the unstructured grid sparse matrix, the many-to-many communication possibility exists in the many-core internal, which causes a communication ring to be generated in the many-core internal at a high probability, and a communication deadlock problem to be generated due to the limited number of register buffers. In the process of calculating and updating each vector block, the structure of the sparse matrix is unchanged, and only the value in the matrix is changed, so that the number of times and information of communication in the many cores can be predicted, and the problem of communication deadlock is avoided by converting random communication into ordered communication.
In order to realize ordered communication, a string of instruction streams, namely instruction sequences, are designed for each computing core, and the computing cores execute related operations in strict sequence according to the given instruction streams, so that the problem of communication deadlock is avoided, and overhead time such as additional communication, synchronization and the like is saved.
The many-core communication model based on instruction sequences and message sequences is as follows:
1. and creating an instruction sequence based on the time sequence, and ensuring that the operation instructions are strictly executed according to the time sequence.
2. And creating a message sequence based on the instruction sequence, and ensuring that the message processing corresponding to the instruction is strictly executed according to the instruction sequence.
Due to the pipeline operation, the vector blocks sequentially complete the calculation updating, so that only one calculation core is in a data sending state, and other calculation cores are in a data receiving state or a calculation state at the same time. For a sending core, the priority of sending data is as follows:
a) sending the data in the vector blocks in the same line as the vector blocks, and sending the data according to the ascending sequence of the block numbers;
b) sending the data in the vector blocks in different rows and different columns, and sending the data according to the ascending sequence of the block numbers;
c) and transmitting the data in the x vector blocks in the same column as the x vector blocks, and transmitting the data in the ascending order of the block numbers.
d) And processing the x vector block data which needs to be processed by the sending core.
For the receiving core, the priority of receiving data is as follows:
a) receiving data required to be processed;
b) and forwarding data which does not belong to the self-processing.
For the computing core, the data processing is performed according to the following sequence:
a) processing the data which is needed immediately;
b) temporarily unneeded data is buffered.
The structure diagram of the instruction sequence and the message sequence is shown in fig. 2, and the pseudo code thereof is implemented as follows:
1. algorithm 2: parallel optimization algorithm description based on instruction sequence and message sequence
2. for all pi,where 0≤i<p do
3. Read _ inststruct (); // read instruction sequence to inst array
4. Read _ info (); v/according to the instruction, read the message sequence to the info array
5. Read _ data (); v/reading the data to be processed according to the instruction
6. for block ← 0 to block_size do
7. flag = deciding which block to run to
8. if flag = = send core:
9. size = inst[i].send_row_size;
10. for(i=0;i<size;i++){
11. sendr (info [ i ] length, info [ i ] id); v/Send line message and Forwarding message }
12. size += inst[i].send_col_size;
13. for(;i<size;i++){
14. sendc (info [ i ]. length, info [ i ]. id); // send column message }
15. size += inst[i].self_handle_size;
16. for(;i<size;i++){
17. self _ handle (); v/processing data sent to itself }
18. if flag = = same line as the sending core:
19. size = inst[i].recv_row_ size;
20. for(i=0;i<size;i++){
21. recvr (info [ i ]. length); // receive line message }
22. size += inst[i].send_col_size;
23. for(;i<size;i++){
24. recvr (info [ i ]. length); // receiving messages to be forwarded
25. sendc (info [ i ]. length, info [ i ]. id); // forward message }
26. else (4):
27. size = inst[i].recv_col_size;
28. for(i=0;i<size;i++){
29. recvc (info [ i ]. length); // receive column message }
30. ALLSYN; // hardware synchronization
31. endfor
32. endfor
Through the optimization, the communication deadlock problem is avoided, the internal communication efficiency of the many-core is greatly improved, and actual subject tests show that the acceleration effect of the method is more than 3 times on average and 4.68 times at most compared with a serial algorithm on a main core under the grid scale of 10 ten thousand to 100 ten thousand.
When the depth dependence problem parallel method based on instruction sequence and message sequence guidance is adopted, aiming at the common depth dependence problem (such as sparse lower triangular solution) in the field of scientific computing, the traditional concept of hierarchical parallel is abandoned, preprocessing of data in a hierarchical parallel algorithm is reduced, local parallel pipeline operation is realized based on the instruction sequence and the message sequence, the mutual cooperation of many-core internal communication is improved, the problem of communication deadlock caused by the communication of many-core internal registers is avoided, a large amount of overhead time of extra communication, synchronization and the like is saved, and the many-core internal communication efficiency and the optimization efficiency are greatly improved.
To facilitate a better understanding of the invention, the terms used herein will be briefly explained as follows:
the problem of depth dependence: in the solution of some unstructured grid problems, the english is Depth dependency, that is, the following solution depends on the previous solution to be solved, for example, the common sparse lower trigonometric equation is solved. Assuming that the element xi in the x vector depends on xj, and the distance between i and j is large, i.e. i-j > >1, we refer to the depth dependency relationship, in which case, only the corresponding data can be accessed and stored in a fine-grained discrete manner.
The above embodiments are merely illustrative of the technical ideas and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims (4)

1. A depth dependence problem parallel method based on instruction sequence and message sequence guidance is characterized in that: the non-network oriented structure comprises the following steps:
s1, evenly dividing the solution vector according to blocks according to the load balance calculated among the secondary cores, determining the number of each vector block, and converting the dependency among elements in the solution vector into the dependency among the vector blocks;
s2, the completion of the calculation of a vector chunk is referred to as an update operation, and the update operation requires the following three steps:
s21, partial updating: the current vector block receives data sent by the previous vector block through register communication, wherein the initial vector block can be directly solved;
s22, self-updating: the current vector block uses the data transmitted by the previous vector block to solve the elements in the current vector block, namely the solution vector corresponding to the vector block;
s23, updating is completed: the current vector block sends the solved elements to the dependent subsequent vector block;
s3, sequentially calculating corresponding vector blocks by each calculation core in the many cores, and repeating the updating operation step of S2 to realize the calculation of the whole pipeline operation;
s4, creating an instruction sequence for each computing core based on the time sequence, wherein the instruction sequence is strictly executed according to the time sequence, and the many-core communication model based on the instruction sequence and the message sequence is as follows:
s41, creating an instruction sequence based on the time sequence to enable the operation instruction of the updating operation to be executed according to the time sequence, wherein the instruction sequence comprises row sending and receiving times, column sending and receiving times and forwarding times;
s42, creating a message sequence based on the instruction sequence, and strictly executing the message processing corresponding to the operation instruction according to the order of the instruction sequence, wherein the message sequence comprises row sending and receiving messages, column sending and receiving messages and forwarding messages;
s5, sequentially finishing calculation and updating by the vector blocks, wherein at the same time, only one calculation core is in a data sending state, and other calculation cores are in a data receiving state or a calculation state;
s6, for the sending core, i.e. the computing core in the data sending state, the priority of sending data is as follows:
s61, sending the data in the vector blocks in the same line as the vector blocks, and sending the data in the sequence of increasing the number of the vector blocks;
s62, sending the data in the vector blocks in different rows and different columns, and sending the data in the sequence of increasing the number of the vector blocks;
s63, sending the data in the vector blocks in the same column as the vector blocks, and sending the data in the sequence of increasing the number of the vector blocks;
s64, processing the vector block data needed to be processed by the sending core, namely calculating the partial solution of the current vector block according to the received data of the previous vector block;
s7, for the receiving core, i.e. the computing core in the data receiving state, the priority of receiving data is as follows:
s71, receiving data which needs to be processed, namely the data used for calculating the current vector chunk solution from the previous vector chunk;
s72, forwarding data which do not belong to self processing, namely solving the data which are not needed by the current vector block;
s8, for the computation core, i.e. the computation core in the computation state, the data processing is performed in the following order:
s81, processing the data needed immediately, the line number of the preceding vector block can correspond to the non-zero column number of the vector block, namely the data needed immediately;
and S82, buffering the temporarily unnecessary data to a buffer area, namely, the slave core station.
2. The instruction sequence and message sequence based depth dependent problem parallelism method according to claim 1, characterized in that: the partial update in S2 is performed concurrently, and the self-update is performed serially.
3. The instruction sequence and message sequence based depth dependent problem parallelism method according to claim 1, characterized in that: the data in step S21 includes the solved values of the preceding vector chunks and the corresponding location information.
4. The instruction sequence and message sequence based depth dependent problem parallelism method according to claim 1, characterized in that: the latter vector chunk can not start to calculate until all the vector chunks in front of the former vector chunk are calculated and the elements required by the calculation are received.
CN201910879931.6A 2019-09-18 2019-09-18 Depth dependence problem parallel method based on instruction sequence and message sequence guidance Withdrawn CN112527394A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910879931.6A CN112527394A (en) 2019-09-18 2019-09-18 Depth dependence problem parallel method based on instruction sequence and message sequence guidance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910879931.6A CN112527394A (en) 2019-09-18 2019-09-18 Depth dependence problem parallel method based on instruction sequence and message sequence guidance

Publications (1)

Publication Number Publication Date
CN112527394A true CN112527394A (en) 2021-03-19

Family

ID=74974946

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910879931.6A Withdrawn CN112527394A (en) 2019-09-18 2019-09-18 Depth dependence problem parallel method based on instruction sequence and message sequence guidance

Country Status (1)

Country Link
CN (1) CN112527394A (en)

Similar Documents

Publication Publication Date Title
CN108509270B (en) High-performance parallel implementation method of K-means algorithm on domestic Shenwei 26010 many-core processor
TW202014939A (en) Modifying machine learning models to improve locality
US20190138922A1 (en) Apparatus and methods for forward propagation in neural networks supporting discrete data
CN112862088A (en) Distributed deep learning method based on pipeline annular parameter communication
CN103699442B (en) Under MapReduce Computational frames can iterative data processing method
CN103177414A (en) Structure-based dependency graph node similarity concurrent computation method
CN112231630B (en) Sparse matrix solving method based on FPGA parallel acceleration
CN104052495A (en) Low density parity check code hierarchical decoding architecture for reducing hardware buffer
CN117435855B (en) Method for performing convolution operation, electronic device, and storage medium
CN105022631A (en) Scientific calculation-orientated floating-point data parallel lossless compression method
CN110086602A (en) The Fast implementation of SM3 cryptographic Hash algorithms based on GPU
CN110620587A (en) Polarization code BP decoding unit based on different data type transmission
CN114492753A (en) Sparse accelerator applied to on-chip training
CN117827463A (en) Method, apparatus and storage medium for performing attention calculations
CN110135067B (en) Helicopter flow field overlapping mixed grid parallel method under double time step method
CN112527394A (en) Depth dependence problem parallel method based on instruction sequence and message sequence guidance
CN110008436B (en) Fast Fourier transform method, system and storage medium based on data stream architecture
CN111832144B (en) Full-amplitude quantum computing simulation method
CN112446004B (en) Non-structural grid DILU preconditioned sub-many-core parallel optimization method
CA3187339A1 (en) Reducing resources in quantum circuits
CN111723246B (en) Data processing method, device and storage medium
CN107368287B (en) Acceleration system, acceleration device and acceleration method for cyclic dependence of data stream structure
JP6961950B2 (en) Storage method, storage device and storage program
KR20220100030A (en) Pattern-Based Cache Block Compression
US20190073584A1 (en) Apparatus and methods for forward propagation in neural networks supporting discrete data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20210319