CN111966405B

CN111966405B - Polar code high-speed parallel decoding method based on GPU

Info

Publication number: CN111966405B
Application number: CN202010629868.3A
Authority: CN
Inventors: 李舒
Original assignee: Hangzhou Innovation Research Institute of Beihang University
Current assignee: Hangzhou Innovation Research Institute of Beihang University
Priority date: 2020-07-03
Filing date: 2020-07-03
Publication date: 2022-07-26
Anticipated expiration: 2040-07-03
Also published as: CN111966405A

Abstract

The invention discloses a GPU-based Polar code high-speed parallel decoding method, wherein the whole decoding process can be divided into three stages: the initialization stage, the decoding stage and the result returning stage specifically comprise: step 1: initializing a host; step 2: initializing a GPU; and 3, step 3: the decoding kernel function carries out a plurality of times of loop iteration, and the maximum loop times are preset by a program; and 4, step 4: for the thread No. 0 of all thread blocks of the factor graph p _ good, the decoding result is obtained after inverse permutation is carried out on the Local _ L [ ] [0] + Local _ R [ ] [0] in the shared memory; and 5: the host transmits the decoded result from the GPU back to the host. The method comprises three levels of parallelism, namely parallelism among multiple subgraphs, between multithreading blocks and between multiple threads. In addition, the method of the invention reduces the starting expense of the kernel function to the maximum extent; the memory access efficiency and the running speed are improved.

Description

Polar code high-speed parallel decoding method based on GPU

Technical Field

The invention belongs to the technical field of communication, and relates to a Polar code high-speed parallel decoding method based on a Graphics Processing Unit (GPU).

Background

Polar Codes proposed by Erdal Arikan in 2008 (reference [1]: Erdal Arikan, "Channel Polarization: a Method for Constructing Capacity-influencing Codes", IEEE ISIT2008) are the only Channel coding methods that can be strictly proven to reach shannon limit at present. Polar codes have been adopted officially by the 5G standardization organization. Decoding methods of Polar codes can be divided into two types: a serial cancellation based approach and a belief propagation based approach. The serial offset-based method has small operation amount, but the algorithm is serial in nature, so the decoding delay is large; for the belief propagation-based method, in order to ensure the error correction performance of Polar code decoding, a belief propagation list algorithm, namely an iterative algorithm based on a plurality of permutation factor graphs, is generally adopted, so that the decoding method has a large operation amount, but the belief propagation list algorithm has the potential of parallel implementation.

On the other hand, the GPU technology has been developed rapidly in recent years, and a commercial grade GPU card can have over 4000 cores for parallel processing, which provides a cost-effective hardware basis for parallel computing.

Disclosure of Invention

The invention aims to provide a GPU-based Polar code high-speed parallel decoding method to realize low-delay and high-throughput decoding.

The invention provides a Polar code high-speed parallel decoding method based on a GPU. The method comprises three layers of parallelism, and can fully utilize the core resources on the GPU. The invention also designs an efficient distributed storage method, and improves the memory access efficiency and the running speed. The whole decoding process can be divided into three stages: initialization stage, decoding stage, and result returning stage. The initialization phase comprises the following steps 1 and 2, the decoding phase comprises the following steps 3 and 4, and the following step 5 is a result returning phase.

Step 1: and (4) initializing the host. Sequentially comprises the following steps: allocating memory space for information bit mark, factor graph replacement and inverse replacement information, signals received by a receiver, decoding results, namely log-likelihood ratio of source bits, initializing information and variables, storing the received signals and calculating log-likelihood ratio of coding bits.

And 2, step: and initializing the GPU. Sequentially comprises the following steps: and (3) distributing global memory of the GPU, sending data to the GPU by the host, starting a parallel decoding thread of the GPU, distributing shared memory by the GPU, initializing the shared memory, and assigning values to the array of the shared memory according to the global memory.

And 3, step 3: the decoding kernel function performs a plurality of loop iterations, and the maximum loop times are preset by a program. Each cycle comprises the following steps: and exchanging thread block shared memories between the L1 stage and the L1-L2 stage, exchanging thread block shared memories between the L2 stage and the R1 stage, and exchanging thread block shared memories between the R1-R2 stages, and judging the R2 stage and the cycle termination condition. If the factor graph meets the early termination condition during the loop or the maximum number of loops has been reached, the variable p _ good is set and the loop is terminated and jumps to step 4.

And 4, step 4: thread number 0, i.e., thread ((p _ good, b),0), for all thread blocks of the factor graph p _ good, where b is 0,1,.., N1-1,

n is the code length of Polar code, and is shared with Local _ L [ 2] in the memory][0]+Local_R[][0]After inverse permutation, the result is used as the decoding result.

And 5: the host transmits the decoded result from the GPU back to the host.

Wherein, each loop of the loop iteration of the step 3 comprises the following steps:

step 3.1: the first stage of the left iteration, the L1 stage, includes an n-1, n-n1 stage iteration, where n is log ₂ N，n1＝log ₂ N1；

Step 3.2: exchanging Local _ L [ ] [ n-n1+1] in the shared memory between the thread blocks of each factor graph through a global memory;

step 3.3: the second stage of the left iteration, the L2 stage, includes the nth-n 1-1.., 0 stage iterations;

step 3.4: the first stage of the right iteration, the R1 stage, includes the 0 th,.., n-n1-1 stage iterations;

step 3.5: exchanging Local _ R [ ] [ n-n1] in the shared memory between the thread blocks of each factor graph through the global memory;

step 3.6: the second phase of the right iteration, the R2 phase, includes the n-n1, n-1 level iterations;

step 3.7: and judging whether a factor graph meets an early termination condition or reaches the maximum cycle number, and setting a variable p _ good.

Wherein, the stages L1, L2, R1 and R2 in step 3 each include three levels of parallelism as follows:

the first level is parallelism between multiple factor graphs, each factor graph being responsible for N1 thread blocks. Because different factor graphs have no data dependencies in the iterative process, the thread blocks of different factor graphs can naturally run in parallel.

The second level is parallelism of multi-threaded blocks of the same factor graph. When the code length N of the Polar code is larger, the core resources and the shared memory resources of a stream multiprocessor on the GPU cannot support the requirement of fully parallelizing a factor graph. For this reason, in the present invention, N1 thread blocks are responsible for the iteration of one factor graph. The invention divides the leftward propagation and the rightward propagation of each iteration into two stages, namely four stages. Left propagation includes iterations of the n-1, 0 stages, in two stages: the first stage is the (n-1) th stage to the (n-n) 1 th stage, which is called the L1 stage; the second stage is the n-n1-1 th to 0 th stages, called the L2 stage. The direction of propagation to the right and left is opposite, including iterations of levels 0, 1.., n-1, which are divided into two stages: the first stage is the 0 th stage to the n-n1-1 th stage, which is called the R1 stage; the second stage is the (n-n) 1 th to (n-1) th stages, called the R2 stage.

The third level is multi-thread parallelism within the same thread block. The calculation of each thread block at each stage can be divided into N/N1/2-2 ^n-n1-1 Subtasks, each subtask having no data dependency relationship therebetween, each subtask being divided into min (T, 2) ^n-n1-1 ) Each group of subtasks is responsible for one thread in the thread block, and all threads can be executed in parallel; after each thread completes the set of subtasks for which it is responsible, thread synchronization within the thread block is performed.

The division details of the parallel of the same factor graph multithreading block in the second level and the multithreading parallel (namely, a plurality of subtasks) in the same threading block in the third level are as follows:

(1) in the stage L1, data of different thread blocks have no dependency relationship, and all the thread blocks can run in parallel, namely multithreading blocks of the same factor graph in the stage L1 run in parallel; the same thread block has no data dependency relationship among the sub-tasks of the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to be executed in parallel, namely, multiple threads in the same thread block are executed in parallel in the L1 stage;

(2) in the stage L2, data of different thread blocks have no dependency relationship, and all the thread blocks can run in parallel, namely multithreading blocks of the same factor graph in the stage L2 run in parallel; the same thread block has no data dependency relationship among the sub-tasks of the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to be executed in parallel, namely, multiple threads in the same thread block are executed in parallel in the L2 stage;

(3) in the stage R1, data of different thread blocks have no dependency relationship, and the thread blocks can run in parallel, namely the multithreading blocks of the same factor graph in the stage R1 run in parallel; the same thread block has no data dependency relationship among the sub-tasks of the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to be executed in parallel, namely R1 stage, the threads in the same thread block are executed in parallel;

(4) in the stage R2, data of different thread blocks have no dependency relationship, and the thread blocks can run in parallel, namely the multithreading blocks of the same factor graph in the stage R2 run in parallel; the same thread block has no data dependency relationship among the sub-tasks of the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to be executed in parallel, namely R2 stage, the threads in the same thread block are executed in parallel;

in step 2, when the GPU is initialized, the L and R arrays used in the decoding process are distributively stored in the shared memory of each thread block, that is, in a complete cycle process, the shared memory is exchanged for 2 times only by the thread blocks existing in the global, and all other operations can use the shared memory in the thread blocks. The method comprises the following specific steps:

the storage space on the GPU mainly includes a global memory and a shared memory within a thread block (referred to as an intra-block shared memory or a shared memory). The access speed of the shared memory is high, but the capacity is relatively small; the global memory space is large, and the access speed is low.

The main data of Polar code belief propagation decoding are matrices L and R, both of which are N (N + 1). When the code length N is large, the storage space required by L and R is large. In order to improve the memory access speed, the invention divides L and R in the calculation process, and stores the L and R in two-dimensional arrays of Local _ L and Local _ R in a shared memory in blocks of N thread blocks, wherein the first dimension and the second dimension of the Local _ L are respectively N/N1 and N +1, and the size of the Local _ R is the same as that of the Local _ L. The matrix L and R are distributively stored in the shared memory in each thread block by the following method:

(1) for 0<＝j<＝n-n1，L _{b*(N/N1)+d2*N1+d1,j} And R _{b*(N/N1)+d2*N1+d1,j} Local _ L [ d2 × N1+ d1 in the shared memory of the b-th thread block][j]And Local _ R [ d 2N 1+ d1][j]Wherein b is 0, 1., N1-1; d2 ═ 0, 1., N/(N1)*N1)-1；d1＝0,1,...,N1-1；

(2) For n-n1<＝j<＝n，L _{b*(N/N1)+d2*N1+d1,j} And R _{b*(N/N1)+d2*N1+d1,j} Local _ L [ d2 × N1+ b ] stored in the shared memory of the d1 th thread block][j+1]And Local _ R [ d 2N 1+ b][j+1]Wherein b is 0, 1., N1-1; d2 ═ 0, 1., N/(N1 × N1) -1; d1 ═ 0, 1.., N1-1.

According to the distributed storage method, at each stage (s-n-1.., n-n1-1) of the L1 stage, the scheme of storing L and R in each thread block in a distributed manner is the same, so that each thread block only needs to use the shared memory in the block in each stage iteration in the L1 stage, and data does not need to be exchanged among the thread blocks. Similarly, iterations within the stages of L2, R1, and R2 also do not require data to be exchanged between thread blocks. The distributed storage schemes of the L1 and the L2 are different, so that data exchange between thread blocks existing in the whole office is needed between the L1 stage and the L2 stage; the distributed storage scheme for the L2 and R1 phases is the same, so there is no need to exchange data between thread blocks through global existence between the complete L2 phase and the start R1 phase. Similarly, data exchange between thread blocks existing in the whole office is needed between the stage of completing R1 and the stage of starting R2; data does not need to be exchanged between thread blocks through global presence between the completion of the R2 phase and the L1 phase to begin the next iteration. Therefore, in a complete cycle (namely including 4 stages and 2 n-stage iteration), the shared memory is exchanged for 2 times only by the thread blocks existing in the global, and all other operations can use the shared memory in the thread blocks.

In order to improve the locality of access, the GPU global memory allocation described in step 2 is optimized as follows:

(1) the global memory includes common data for all factor graphs and private data for each factor graph. The global memory used by each factor graph is continuously stored. The global memory used by the pth thread block is a structure graph _ info [ p ], and its members include: the system comprises a factor graph permutation array, an inverse permutation array and a global memory space for exchanging the shared memory by the thread blocks.

(2) Graph _ info [ p ]. swap shared memory between thread blocks is stored continuously according to the order in which thread blocks are read, i.e., when each thread block is read from graph _ info [ p ]. swap, its address space is continuous.

The invention has the advantages and positive effects that:

(1) the parallel decoding of the invention comprises three levels of parallelism: the first level is parallelism between multiple factor graphs, each factor graph being responsible for several thread blocks. Because different factor graphs have no data dependency relationship in the iteration process, thread blocks of different factor graphs can naturally run in parallel; the second level is the parallel of the multithread blocks of the same factor graph, in the invention, the iteration of one factor graph is responsible for N1 thread blocks, the thread blocks designed by the invention are divided into work to ensure that no data dependence exists among the thread blocks in the same stage (namely L1, L2, R1 and R2), so that the N1 thread blocks can be executed in parallel; the third layer is the multithread parallelism in the same thread block, the invention divides the iteration of the same thread block at each stage into N/2N1 subtasks which have no dependence on each other, then distributes the subtasks to a plurality of threads in the thread block to execute in parallel, and after each thread executes all the subtasks responsible for the stage, the synchronization in the thread block is carried out. The method can fully utilize the parallel core resources of the GPU and has high calculation efficiency. The invention has small synchronization overhead: in a complete cycle (namely comprising 4 stages and 2n levels of iteration), the invention only needs to use the synchronization between the thread blocks for 2 times, and the rest 2n-2 levels all use the synchronization mechanism in the thread blocks, so that the synchronization overhead is small. In addition, the whole decoding process on the GPU uses a kernel function, and the starting expense of the kernel function is reduced to the maximum extent.

(2) The invention stores the main data in iteration in the shared memory in each thread block in a distributed manner, thereby improving the memory access efficiency and the running speed. In a complete cycle (namely, the process comprises 4 stages and 2n stages of iteration in total), the shared memory is exchanged for 2 times only through the thread blocks existing in the whole process, and all other operations can use the shared memory in the thread blocks. In addition, in the global space used by the invention, the proprietary data of the same factor graph are continuously stored, and the global memory space used for exchanging the shared memory among the thread blocks is continuously stored according to the reading sequence of the thread blocks, thereby optimizing the storage locality and improving the memory access efficiency and the running speed.

Drawings

FIG. 1 is a flowchart of a GPU-based Polar code high-speed parallel decoding method according to the invention.

Detailed Description

The present invention will be described in detail with reference to the accompanying drawings.

The invention provides a GPU-based Polar code high-speed parallel decoding method, which adopts a decoding algorithm based on a belief propagation list. The coding method comprises three stages: initialization stage, decoding stage, and result returning stage. The initialization stage includes the following steps 1 and 2, the decoding stage includes the following steps 3 and 4, the following step 5 is a result feedback stage, and the whole decoding process is shown in fig. 1 and specifically includes the following steps:

step 1: and (4) initializing the host. Sequentially comprises the following steps: allocating memory space for information bit marks, factor graph permutation and inverse permutation information, signals received by a receiver, decoding results, namely log-likelihood ratios of source bits (step 1.1), initializing information and variables (step 1.2), storing the received signals and calculating the log-likelihood ratios of coded bits (step 1.3); the method comprises the following specific steps:

step 1.1: allocating a host memory space, sequentially comprising: the information bit flag array InfoBitFlags, the signal array y received by the receiver, and the log-likelihood ratio array cLLR of the coded bits.

Step 1.2: the host initializes the information bit flag array InfoBitFlags, the factor graph permutation array Perm, and calculates the inverse permutation array InvPerm according to the factor graph permutation array.

Step 1.3: the host stores the signal received by the receiver in array y and the signal-to-noise ratio of the channel in variable SNR. The host calculates a code bit log likelihood ratio array cLLR y SNR from the received signal and the signal-to-noise ratio.

Step 2: and initializing the GPU. Sequentially comprises the following steps: distributing global memory of the GPU, sending data to the GPU by the host, starting a parallel decoding thread of the GPU, distributing shared memory by the GPU, initializing the shared memory, and assigning values to the array of the shared memory according to the global memory; the specific process is as follows:

step 2.1: and distributing the global memory on the GPU. Including common data for all factor graphs and proprietary data for each factor graph. The common data of all the factor graphs comprises a code bit log-likelihood ratio array cLLR, an information bit flag array InfoBitFlags and a decoding result, namely a source bit log-likelihood ratio array uLLR. The proprietary data of each factor graph is stored continuously, the proprietary data of the p-th factor graph is stored in a structure graph _ info [ p ], and the members of the proprietary data comprise: the system comprises a factor graph permutation array Perm, an inverse permutation array InvPerm and a global memory space swap for a thread block exchange shared memory.

Step 2.2: and sending an information bit flag array, a factor graph permutation array, a reverse permutation array and a coding bit log-likelihood ratio array of a host memory to the GPU, wherein the factor graph permutation array and the reverse permutation array are factor graph exclusive data and are stored in a structure graph _ info [ p ].

Step 2.3: the host starts parallel decoding threads of the GPU, and the number of the thread blocks is P × N1, each thread block comprises T threads, wherein T is equal to the number of cores contained in each streaming multiprocessor. All threads execute the same decoding kernel function, and the threads are distinguished by thread indexes, wherein the index of each thread is ((p, b), t), the (p, b) is the thread block index, and the t is the thread index in the thread block.

Step 2.4: the thread of the GPU allocates shared memory in the thread blocks, which comprises two-dimensional arrays Local _ L [ N/N1] [ N +2] and Local _ R [ N/N1] [ N +2], and all elements are initialized to 0.

Step 2.5: thread number 0, i.e., thread ((P, b),0) (P-0, 1, P-1; b-0, 1, 2) of each thread block on the GPU ⁿ¹ -1) giving Local _ L [ 2] a bit log-likelihood ratio based on the information bit flag and the code bit][n+2]And Local _ R [ 2]][0]And (7) assigning values.

In the program, this step can be implemented by a recirculation, and the specific flow is as follows:

(1) the loop index variable is d, d is 0,1

(2) Calculating dd ═ (d% N1) × N1+ (d/N1)

(3) The global memory address corresponding to Local _ L [ d ] [ N +2] in the Local memory is dd × N1+ b, and the index of the cause sub-map (i.e., the factor sub-map before replacement) is graph _ info [ p ]. InvPerm [ dd × N1+ b ], so that the value of cblr [ graph _ info [ p ]. InvPerm [ dd × N1+ b ] is assigned to Local _ L [ d ] [ N +2 ].

(4) The global memory address corresponding to Local _ R [ d ] [0] in the Local memory is b × N1+ d, and the index in the cause graph (i.e., the factor graph before replacement) is graph _ info [ p ]. InvPerm [ b × N1+ d ]. If the information bit flag InfoBitFlags [ graph _ info [ p ]. InvPerm [ b. N1+ d ] ] is 0, Local _ R [ d ] [0] is set to 1e + 30.

An example procedure is as follows:

the index is ((P, b),0), P ═ 0,1,. ang., P-1; b is 0,1, 2 ⁿ¹ -1 threads share P2 ⁿ¹ P2 of ⁿ¹ Threads may be executed in parallel.

And step 3: the decoding kernel function performs a plurality of loop iterations, and the maximum loop number is preset by a program. Each loop includes stages L1 (step 3.1), exchanging thread block shared memory between stages L1-L2 (step 3.2), L2 (step 3.3), R1 (step 3.4), exchanging thread block shared memory between stages R1-R2 (step 3.5), R2 (step 3.6), and determining whether a factor graph satisfies an early termination condition or has reached a maximum number of loops, and setting a variable p _ good (step 3.7).

Step 3.1: the first stage of the left iteration, stage L1, has the stage number s-n-1.

Each level contains three levels of parallelism:

(1) parallelism between multi-factor graphs: the p-th factor graph is indexed by (p,0),...,(p,2 ⁿ¹ -1) is responsible for P-0, 1. The thread blocks of each factor graph are independent and can be performed in parallel;

(2) parallel of multithread blocks of the same factor graph: each factor graph is composed of 2 ⁿ¹ Each thread block is responsible for computation, and the b (b) is 0,1 ⁿ¹ The first set of dimension indices L and R used by 1) thread blocks is L1_ block (b) ═ ia × 2 ⁿ¹ +b,ia＝0,...,2 ^n-n1 -1} so there is no data dependency between thread blocks and it can run in parallel.

(3) Multithreading within the same factor graph thread block in parallel: the calculation of each thread block at each stage can be decomposed into N/N1/2-2 ^n-n1-1 And (5) sub-tasks. Numbered i (i ═ 0, 1.., 2.) ^n-n1-1 The subtask of 1) uses a first set of indices of L and R at level s as:

L1_Task(b,i,s)＝{floor(i/2 ^s-n1 )*2 ^s+1 +a*2 ^s +(i％2 ^s-n1 )*2 ⁿ¹ +b:a＝0,1}；

therefore, there is no data dependency relationship between the subtasks in the same thread block, so that these subtasks can be divided into min (T, 2) ^n-n1-1 ) And each group of subtasks is responsible for one thread in the thread block, and all threads can execute in parallel. After each thread completes the set of subtasks for which it is responsible, thread synchronization within the thread block is performed.

According to the thread and subtask division scheme of the present invention, the ith subtask of a thread block (p, b) at the s-th level requires the computation of:

L _up,s ＝f(L _down,s+1 +R _down,s ,L _up,s+1 )

L _down,s ＝L _down,s+1 +f(L _up,s+1 ,R _up,s )

wherein

up＝floor(i/2 ^s-n1 )*2 ^s+1 +(i％2 ^s-n1 )*2 ⁿ¹ +b

down＝floor(i/2 ^s-n1 )*2 ^s+1 +2 ^s +(i％2 ^s-n1 )*2 ⁿ¹ +b

According to the distributed storage scheme of the present invention, at level L1Segment, up and down are Local _ up and Local _ down, respectively, and s +1 are s +1 and s +2, respectively, where Local _ up is floor (i'/2) ^j )*2 ^j+1 +(i’％2 ^j ),Local_down＝floor(i’/2 ^j )*2 ^j+1 +2 ^j +(i’％2 ^j ),i’＝(i％2 ⁿ ^-2n1 )*2 ^n-2n1 +floor(i/2 ^n-2n1 ),i’＝0,1,...,2 ^n-n1-1 -1, j-s-n + n1, Local _ up and Local _ down being the shared memory address of the thread block (p, b).

Therefore, the above calculation program can be expressed as:

Local_L[Local_up][s+1]＝f(Local_L[Local_down][s+2]+Local_R[Local_down][s+1],Local_L[up][s+2])；

Local_L[Local_down][s+1]＝Local_L[Local_down][s+2]+f(Local_L[Local_up][s+2],Local_R[up][s+1])；

the function f is defined as: f (x, y) ═ 2tanh ^-1 (tanh (x/2) tanh (y/2)), in actual calculation, it is generally approximated by f (x, y) ≈ 0.9375sgn (x) sgn (y) min (| x |, | y |).

In the program, the step can be realized by two cycles, and the specific flow is as follows:

(1) the loop index variable of the first iteration is s, n-n1. (Note that in this loop, the loop index variable s is decremented by 1 each time.)

(2) Calculate j-s-n + n1.

(3) The loop index variable of the second loop is i, i ═ T, T + T,. times, T + (floor (2) ^n-n1-1 -1-t)/T)*T。

(4) The binary representation of i includes n-n1-1 bits, i% being the lower j bits of i (2) ^j ) Assigning variable i _ LSB; and the high (n-n1-1-j) bit of i, i.e. floor (i/(2) ^j ) Is assigned to the variable i _ MSB.

(5) Shared memory addresses Local _ up and Local _ down are calculated, where Local _ up is (i _ MSB < (j +1)) + i _ LSB and Local _ down is Local _ up + (1< < j).

(6) Calculating Local _ L [ Local _ up ] [ s +1] and Local _ L [ Local _ down ] [ s +1], wherein the former has a value of f (Local _ L [ Local _ down ] [ s +2] + Local _ R [ Local _ down ] [ s +1], Local _ L [ up ] [ s +2]), and the latter has a value of Local _ L [ Local _ down ] [ s +1] } ═ Local _ L [ Local _ down ] [ s +2] +

f(Local_L[Local_up][s+2],Local_R[up][s+1])。

(7) After the second recirculation is completed, __ synchreads () is called to synchronize the threads in the thread block.

An example procedure is as follows:

step 3.2: exchanging Local _ L [ ] [ n-n1+1] in the shared memory among the thread blocks of each factor graph through a global memory, namely exchanging the thread blocks between L1-L2 stages to share the memory; the method comprises the following specific steps:

step 3.2.1 thread number 0, i.e. thread ((P, b),0) (P0, 1, P1, b 0,1, 2, per thread block ⁿ¹ -1) sharing the thread blocks (p, b) to Local _ L [ d2 x 2] in memory ⁿ¹ +d1][n-n1+1]Write to graph _ info [ p ] in global memory].swap[d1*2 ^n-n1 +d2*2 ⁿ¹ +b]Wherein d1 is 0,1 ⁿ¹ -1；d2＝0,1,...,2 ^n-2n1 -1. No ((P, b),0), P ═ 0,1,. ang, P-1; b is 0,1, 2 ⁿ¹ -1 threads share P2 ⁿ¹ And (4) respectively. Since there is no overlap between the addresses where threads write swaps, P2 ⁿ¹ Threads may execute in parallel.

Step 3.2.2 2 of each factor graph ⁿ¹ The thread blocks perform synchronization between the thread blocks. Synchronization is not required between thread blocks of different factor graphs.

Step 3.2.3 thread block (p, d1) thread # 0 ((p, d1),0) will map _ info [ p ] in global memory].swap[d1*2 ^n-n1 +d2*2 ⁿ¹ +b]Write thread blocks (p, d1) share Local _ L [ d2 x 2] in memory ⁿ¹ +b][n-n1]Wherein b is 0,1, 2 ⁿ¹ -1；d1＝0,1,...,2 ⁿ¹ -1；d2＝0,1,...,2 ^n-2n1 -1. Numbered ((P, d1),0), P ═ 0,1,. ang., P-1; d1 ═ 0, 1., 2 ⁿ¹ -1 threads share P2 ⁿ¹ P2 of ⁿ¹ Threads may execute in parallel.

Step 3.3: the second stage of the left iteration, stage L2, has a stage number s-n 1-1.

Each level also contains three levels of parallelism:

(1) parallelism among multiple factor graphs: the p-th factor graph is numbered (p, 0., (p, 2.) ⁿ¹ Thread block of-1), P-0, 1. The thread blocks of each factor graph are independent from each other and can be performed in parallel;

(2) parallel of multithread blocks of the same factor graph: each factor graph is composed of 2 ⁿ¹ Each thread block is responsible for computation, and the b (b) is 0,1 ⁿ¹ -1) the first set of dimension indices of L and R used by the thread blocks is L2_ block (b) { b × 2 } ^n-n1 +ia,ia＝0,...,2 ^n-n1 -1} so there is no data dependency between thread blocks and it can run in parallel.

(3) Multithreading within the same factor graph thread block in parallel: the calculation of each thread block at each stage can be decomposed into N/N1/2-2 ^n-n1-1 And (5) sub-tasks. Numbered i (i ═ 0, 1., 2.) ^n-n1-1 The subtask of-1) uses a first set of indices of L and R at level s as

L2_Task(b,i,s)＝{b*2 ^n-n1 +floor(i/2 ^s )*2 ^s+1 +a*2 ^s +(i％2 ^s ):a＝0,1}

L _up,s ＝f(L _down,s+1 +R _down,s ,L _up,s+1 )

L _down,s ＝L _down,s+1 +f(L _up,s+1 ,R _up,s )

wherein, the first and the second end of the pipe are connected with each other,

up＝b*2 ^n-n1 +floor(i/2 ^s )*2 ^s+1 +(i％2 ^s )

down＝b*2 ^n-n1 +floor(i/2 ^s )*2 ^s+1 +2 ^s +(i％2 ^s )

according to the distributed storage scheme of the present invention, in stage L2, the indexes of up and down in the shared memory are Local _ up and Local _ down, respectively, and the indexes of s and s +1 in the shared memory are s and s +1, respectively, where Local _ up is floor (i/2) ^s )*2 ^s+1 +(i％2 ^s ),Local_down＝floor(i/2 ^s )*2 ^s+1 +2 ^s +(i％2 ^s ) Local _ up and Local _ down are the shared memory addresses of the thread blocks (p, b).

Therefore, the above calculation program can be expressed as:

Local_L[Local_up][s]＝f(Local_L[Local_down][s+1]+Local_R[Local_down][s],Local_L[up][s+1])；

Local_L[Local_down][s]＝Local_L[Local_down][s+1]+f(Local_L[Local_up][s+1],Local_R[up][s])；

(1) the loop index variable for the first iteration is s, n-n1, n-n1-1. (Note that in this loop, the loop index variable s is decremented by 1 each time.)

(2) The loop index variable for the second repeat loop is i, i ═ T, T + T ^n-n1-1 -1-t)/T)*T。

(3) The binary representation of i comprises n-n1-1 bits, i% (2) which is the lower s bits of i ^s ) Assigning to variable i _ LSB; and the high (n-n1-1-s) bit of i, i.e. floor (i/(2) ^j ) To the variable i _ MSB.

(4) Shared memory addresses Local _ up and Local _ down are calculated, where Local _ up is (i _ MSB < (s +1)) + i _ LSB and Local _ down is Local _ up + (1< < s).

(5) Calculating Local _ L [ Local _ up ] [ s ] and Local _ L [ Local _ down ] [ s ], wherein the former has a value of f (Local _ L [ Local _ down ] [ s +1] + Local _ R [ Local _ down ] [ s ], Local _ L [ Local _ up ] [ s +1]), and the latter has a value of Local _ L [ Local _ down ] [ s +1] -, L [ Local _ down ] [ s +1] +

f(L[Local_up][s+1],Local_R[Local_up][s])。

(6) After the second recirculation is completed, __ synchreads () is called to synchronize the threads in the thread block.

An example procedure is as follows:

step 3.4: the first stage of the right iteration, stage R1, has a stage number s ═ 0.

Each level also contains three levels of parallelism:

(1) parallelism between multi-factor graphs: the p-th factor graph is numbered (p, 0., (p, 2.) ⁿ¹ Thread block of-1), P-0, 1. The thread blocks of each factor graph are independent and can be performed in parallel;

(2) parallel of multithread blocks of the same factor graph: each factor graph is composed of 2 ⁿ¹ Each thread block is responsible for computation, and the b (b) is 0,1 ⁿ¹ -1) the first set of dimension indices of L and R used by the thread blocks is R1_ block (b) { b × 2 } ^n-n1 +ia,ia＝0,...,2 ^n-n1 -1} so there is no data dependency between thread blocks and it can run in parallel.

(3) Multithreading in the same factor graph thread block is parallel: the calculation of each thread block at each stage can be decomposed into N/N1/2-2 ^n-n1-1 And (5) subtasks. Numbered i (i ═ 0, 1., 2.) ^n-n1-1 The subtask of-1) uses a first set of indices of L and R at stage s as R1_ Task (b, i, s) { b × 2 } ^n-n1 +floor(i/2 ^s )*2 ^s+1 +a*2 ^s +(i％2 ^s ):a＝0,1}

Therefore, there is no data dependency relationship between the subtasks in the same thread block, so that these subtasks can be divided into min (T, 2) ^n-n1-1 ) And each group of subtasks is responsible for one thread in the thread block, and the threads can execute in parallel. After each thread completes the set of subtasks for which it is responsible, thread synchronization within the thread block is performed.

According to the thread and subtask division scheme of the present invention, the ith subtask of a thread block (p, b) at the s-th level requires the calculation of:

R _up,s+1 ＝f(L _down,s+1 +R _down,s ,R _up,s )

R _down,s+1 ＝R _down,s +f(L _up,s+1 ,R _up,s )

wherein the content of the first and second substances,

up＝b*2 ^n-n1 +floor(i/2 ^s )*2 ^s+1 +(i％2 ^s )

down＝b*2 ^n-n1 +floor(i/2 ^s )*2 ^s+1 +2 ^s +(i％2 ^s )

according to the distributed storage scheme of the invention, in the stage R1, the indexes of up and down in the shared memory are Local _ up and Local _ down, respectively, and the indexes of s and s +1 in the shared memory are s and s +1, respectively, wherein the Local _ up is floor (i/2) ^s )*2 ^s+1 +(i％2 ^s ),Local_down＝floor(i/2 ^s )*2 ^s+1 +2 ^s +(i％2 ^s ) Local _ up and Local _ down are the shared memory addresses of the thread blocks (p, b).

Therefore, the above calculation program can be expressed as:

Local_R[Local_up][s+1]＝

f(Local_L[Local_down][s+1]+Local_R[Local_down][s],R[Local_up][s])；

Local_R[Local_down][s+1]＝Local_R[Local_down][s]+

f(L[Local_up][s+1],R[Local_up][s])；

(1) the loop index variable of the first iteration is s, s ═ 0,1, n-n1, n-n1-1.

(3) The binary representation of i comprises n-n1-1 bits, i% being the lower s bits of i (2) ^s ) Assigning to variable i _ LSB; and the high (n-n1-1-s) bit of i, i.e. floor (i/(2) ^s ) Is assigned to the variable i _ MSB.

(4) Calculating the shared memory addresses Local _ up and Local _ down, wherein the Local _ up ═ i _ MSB <

(s+1))+i_LSB，Local_down＝Local_up+(1<<s)。

(5) Local _ L [ Local _ up ] [ s +1] and Local _ L [ Local _ down ] [ s +1] are calculated, wherein the former has a value of f (Local _ L [ Local _ down ] [ s +1] + Local _ R [ Local _ down ] [ s ], Local _ R [ Local _ up ] [ s ]), and the latter has a value of Local _ R [ Local _ down ] [ s ] + f (L [ Local _ up ] [ s +1], R [ Local _ up ] [ s ]).

(6) After the second recirculation is completed, __ synchreads () is called to synchronize the threads within the thread block.

An example procedure is as follows:

step 3.5: the thread blocks in each factor graph exchange the Local _ R [ ] [ n-n1] in the shared memory through the global memory, namely the thread block shared memory is exchanged among the R1-R2 stages, which is specifically as follows:

step 3.5.1 thread number 0, i.e. thread ((P, b),0) (P0, 1, P1, b 0,1, 2, per thread block ⁿ¹ -1) sharing the thread blocks (p, b) to Local _ R [ d2 x 2] in memory ⁿ¹ +d1][n-n1]Write to graph _ info [ p ] in global memory].swap[d1*2 ^n-n1 +d2*2 ⁿ¹ +b]Wherein d1 ═ 0, 1., 2 ⁿ¹ -1；d2＝0,1,...,2 ^n-2n1 -1. No ((P, b),0), P ═ 0,1,. ang, P-1; 0,1, 2 ⁿ¹ -1 threads share P2 ⁿ¹ And (4) respectively. Since there is no overlap between the addresses at which threads write swaps, P x 2 ⁿ¹ Threads may execute in parallel.

Step 3.5.2 2 of each factor graph ⁿ¹ The individual thread blocks perform synchronization between the thread blocks. Synchronization is not required between thread blocks of different factor graphs.

Step 3.5.3 thread # 0 of the thread chunk (p, d1) ((p, d1),0) maps _ info [ p ] in global memory].swap[d1*2 ^n-n1 +d2*2 ⁿ¹ +b]Write thread block shared memory Local _ R [ d2 x 2 ⁿ¹ +b][n-n1+1]Wherein b is 0,1, 2 ⁿ¹ -1；d1＝0,1,...,2 ⁿ¹ -1；d2＝0,1,...,2 ^n-2n1 -1. Numbered ((P, d1),0), P ═ 0,1,. ang., P-1; d1 ═ 0, 1., 2 ⁿ¹ -1 threads share P2 ⁿ¹ P2 of ⁿ¹ Threads may execute in parallel.

Step 3.6: the second stage of the right iteration, stage R2, stage number s-n 1.

Each level also contains three levels of parallelism:

(1) parallelism among multiple factor graphs: the p-th factor graph is numbered (p, 0., (p, 2.) ⁿ¹ -1) is responsible for P-0, 1. The thread blocks of each factor graph are independent from each other and can be performed in parallel;

(2) parallel of multithread blocks of the same factor graph: each factor graph is composed of 2 ⁿ¹ Each thread block is responsible for computation, and the b (b) is 0,1 ⁿ¹ A first set of dimension indices of L and R used by 1) thread blocks is R2_ block (b) ═ ia × 2 ⁿ¹ +b,ia＝0,...,2 ^n-n1 -1} so there is no data dependency between thread blocks and it can run in parallel.

(3) Multithreading in the same factor graph thread block is parallel: the calculation of each thread block at each stage can be decomposed into N/N1/2-2 ^n-n1-1 And (5) subtasks. Numbered i (i ═ 0, 1.., 2.) ^n-n1-1 The subtask of 1) uses a first set of indices of L and R at level s as:

R2_Task(b,i,s)＝{floor(i/2 ^s-n1 )*2 ^s+1 +a*2 ^s +(i％2 ^s-n1 )*2 ⁿ¹ +b:a＝0,1}

R _up,s+1 ＝f(L _down,s+1 +R _down,s ,R _up,s )

R _down,s+1 ＝R _down,s +f(L _up,s+1 ,R _up,s )

wherein the content of the first and second substances,

up＝floor(i/2 ^s-n1 )*2 ^s+1 +(i％2 ^s-n1 )*2 ⁿ¹ +b

down＝floor(i/2 ^s-n1 )*2 ^s+1 +2 ^s +(i％2 ^s-n1 )*2 ⁿ¹ +b

according to the distributed storage scheme of the present invention, in stage R2, the indices of up and down in the shared memory are Local _ up and Local _ down, respectively, and the indices of s and s +1 in the shared memory are s +1 and s +2, respectively, where Local _ up is floor (i'/2) ^j )*2 ^j+1 +(i’％2 ^j ),Local_down＝floor(i’/2 ^j )*2 ^j+1 +2 ^j +(i’％2 ^j ),i’＝(i％2 ⁿ ^-2n1 )*2 ^n-2n1 +floor(i/2 ^n-2n1 ),i’＝0,1,...,2 ^n-n1-1 -1, j-s-n + n1, Local _ up and Local _ down being the shared memory address of the thread block (p, b).

Therefore, the above calculation program can be expressed as:

Local_R[Local_up][s+2]＝f(Local_L[Local_down][s+2]+Local_R[Local_down][s+1],[Local_up][s+1])；

Local_R[Local_down][s+2]＝Local_R[Local_down][s+1]+

f(L[Local_up][s+2],R[Local_up][s+1])；

(1) the loop index variable for the first iteration is s, s — n1.

(2) Calculate j-s-n + n1.

(4) The binary representation of i includes n-n1-1 bits, i% (2) which is the lower j bits of i ^j ) Assigning to variable i _ LSB; and the high (n-n1-1-j) bit of i, i.e. floor (i/(2) ^j ) To the variable i _ MSB.

(6) Local _ L [ Local _ up ] [ s +2] and Local _ L [ Local _ down ] [ s +2] were calculated, wherein the former value was f (Local _ L [ Local _ down ] [ s +2] + Local _ R [ Local _ down ] [ s +1], Local _ R [ Local _ up ] [ s +1]), and the latter value was L [ Local _ up ] [ s +2], Local _ R [ Local _ up ] [ s +1] + f (L [ Local _ up ] [ s +2], Local _ R [ Local _ up ] [ s +1 ]).

(7) After the second recirculation is completed, __ synchreads () is called to synchronize the threads within the thread block.

An example procedure is as follows:

step 3.7: and judging whether the iteration result of each factor graph meets the early termination condition. If at least one factor graph satisfies the condition, the number p of the factor graph is recorded in the variable p _ good (if a plurality of factor graphs satisfy the condition, any p satisfying the condition is recorded in the variable p _ good), and the loop terminates, i.e., jumps to step 4. Otherwise, namely all factor graphs do not meet the condition, judging whether the preset maximum cycle time is reached, if the maximum cycle time is reached, making p _ good equal to 0 (corresponding to the first factor graph), and skipping to the step 4 after the cycle is terminated; if the preset maximum loop times are not reached, the loop is continued, namely, the step 3.1 is skipped. The early termination condition may have various conditions, such as no iteration improvement, passing of an additional CRC check, and the like. Wherein the maximum loop times may be selected to be the same as the serialization implementation of Polar code belief propagation decoding, and is typically between 50 and 200.

And 4, step 4: the factor graph p _ good represents the thread number 0 of all thread blocks, i.e., the thread ((p _ good, b),0), where b is 0,1, a., N1-1, and the decoding result is stored in the llr after the Local _ L [ ] [0] + Local _ R [ ] [0] in the shared memory is subjected to inverse permutation. There are N1 threads indexed ((p _ good, b),0) that can be executed in parallel.

(1) the loop index variable is d, d is 0,1, a, N/N1-1

(2) The global memory address corresponding to Local [ L ] [0] and Local _ R [ d ] [0] in the Local memory is b × N1+ d, and the index in the cause graph (i.e., the factor graph before replacement) is graph _ info [ p _ good ]. InvPerm [ b × N1+ d ]. Local _ L [ d ] [0] and Local _ R [ d ] [0] are added, and the result is stored in a uLLR [ graph _ info [ p _ good ]. InvPerm [ b ] N1+ d ].

An example procedure is as follows:

for(d＝0；d<N/N1；d++)

uLLR[graph_info[p_good].InvPerm[b*N1+d]]＝

Local_L[d][0]+Local_R[d][0]；

and 5: the host transmits the decoded result, i.e., the log-likelihood ratios of the source bits, llr back from the GPU to the host.

Claims

1. A Polar code high-speed parallel decoding method based on a GPU is characterized in that: the whole decoding process is divided into three stages: the method comprises an initialization stage, a decoding stage and a result returning stage, wherein the initialization stage comprises the following steps 1 and 2, the decoding stage comprises the following steps 3 and 4, and the following step 5 is the result returning stage:

step 1: host initialization

Sequentially comprises the following steps: allocating memory space for information bit marks, factor graph replacement and inverse replacement information, signals received by a receiver, and decoding results, namely log-likelihood ratios of source bits, initializing information and variables, storing the received signals and calculating log-likelihood ratios of coding bits;

step 2: GPU initialization

Sequentially comprises the following steps: distributing global memory of the GPU, sending data to the GPU by the host, starting a parallel decoding thread of the GPU, distributing shared memory by the GPU, initializing the shared memory, and assigning values to an array of the shared memory according to the global memory;

and step 3: the decoding kernel function performs a plurality of loop iterations, and the maximum loop number is preset by the program

Each cycle comprises the following steps: l is ₁ Stage, L _1- L ₂ Inter-phase exchange thread block shared memory, L ₂ Stage, R ₁ Stage, R ₁ -R ₂ Inter-phase exchange thread block shared memory, R ₂ Judging the conditions of stage and cycle termination: if the factor graph meets the early termination condition in the circulation process or the maximum circulation times are reached, setting a variable p _ good, terminating the circulation and jumping to the step 4;

and 4, step 4: thread number 0 for all thread blocks of the variable p _ good, i.e. thread ((p _ good, b),0), where b ═ 0,1 ₁ -1，

N is the code length of Polar code, and is shared by Local _ L in the memory][0]+Local_R[][0]After inverse permutation, the result is used as a decoding result;

and 5: the host transmits the decoded result from the GPU back to the host.

2. The method of claim 1, wherein the method comprises the following steps: in step 2, when the GPU is initialized, the L and R arrays used in the decoding process are distributively stored in the shared memory of each thread block, that is, in a complete cycle process, the shared memory is exchanged for 2 times only through the thread blocks existing in the global, and all other operations use the shared memory in the thread blocks.

3. The method for high-speed parallel decoding of Polar codes based on GPU as claimed in claim 1, wherein: the allocating of the global memory in step 2 specifically includes: the global memories used in the same factor graph are stored continuously, and the global memory spaces used for exchanging the shared memory among the thread blocks are stored continuously according to the reading sequence of the thread blocks, that is, when each thread block reads from the exchange space, the read address spaces are continuous.

4. The method of claim 1, wherein the method comprises the following steps: each loop of the step 3 loop iteration comprises the following steps:

step 3.1: the first stage of the left iteration, i.e. L ₁ Stages, including n-1, n ₁ Iteration of stages, where n ₁ ＝log ₂ N ₁ ；

Step 3.2: local _ L in the shared memory is exchanged between the thread blocks of each factor graph through the global memory][n-n ₁ +1]；

Step 3.3: second stage of left iteration, i.e. L ₂ Stage, including n-n ₁ 1,0 level iteration;

step 3.4: the first stage of the right iteration, R ₁ Stages including 0,1, n-n ₁ -level 1 iteration;

step 3.5: local _ R in the shared memory is exchanged between the thread blocks of each factor graph through the global memory][n-n ₁ ]；

Step 3.6: second stage of rightward iteration, i.e. R ₂ Stage including n-n ₁ ,., n-1 level iterations;

5. The method of claim 4, wherein the Polar code high-speed parallel decoding method based on GPU is characterized in that: l in step 3 ₁ Stage, L ₂ Stage, R ₁ Stage and R ₂ Stages, each stage comprising three levels of parallelism:

the first level is parallelism between multiple factor graphs, each factor graph consisting of N ₁ Each thread block is responsible for, the thread blocks of each factor graph are independent, and the thread blocks of different factor graphs run in parallel;

the second level is the parallelism of multithread blocks of the same factor graph, each factor graph is composed of N ₁ Each thread block is responsible for calculation, and different thread blocks run in parallel without data dependence;

the third level is multithread parallelism in the same thread block, and the calculation of each thread block in each level is divided into

Subtasks, each subtask having no data dependency relationship therebetween, each subtask being divided into

The group is a core number contained in each streaming multiprocessor on the GPU, each group of subtasks is responsible for one thread in the thread block, and all threads are executed in parallel; after each thread completes the set of subtasks for which it is responsible, thread synchronization within the thread block is performed.

6. The method of claim 5, wherein the method comprises the following steps: the division details of the parallel of the same factor graph multithreading block in the second level and the multithreading parallel in the same threading block in the third level are as follows:

(1) at L ₁ Stage, there is no dependency between the data of different thread blocks, and all thread blocks run in parallel, i.e. L ₁ Multithread blocks of the same factor graph in the same stage are parallel; the same thread block has no data dependency relationship among the sub-tasks of the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to be executed in parallel, namely L ₁ In the same thread block, multiple threads are parallel;

(2) at L ₂ Stage, there is no dependency relationship between data of different thread blocks, and each thread block runs in parallel, i.e. L ₂ Multithread blocks of the same factor graph in the same stage are parallel; the same thread block has no data dependency relationship among the sub-tasks of the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to be executed in parallel, namely L ₂ In the same thread block, multiple threads are parallel;

(3) at R ₁ Phase, data of different thread blocks do not haveDependency relationships, with blocks of threads running in parallel, i.e. R ₁ Multithread blocks of the same factor graph in the same stage are parallel; the same thread block has no data dependency relationship among the sub-tasks of the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to be executed in parallel, namely R ₁ In the same thread block, multiple threads are parallel;

(4) at R ₂ Stage, there is no dependency relationship between data of different thread blocks, and all thread blocks run in parallel, namely R ₂ Multi-thread blocks in the same factor graph are parallel in the stage; the same thread block has no data dependency relationship among the sub-tasks of the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to be executed in parallel, namely R ₂ And (4) carrying out multithreading in parallel in the same thread block.