CN111966405A

CN111966405A - Polar code high-speed parallel decoding method based on GPU

Info

Publication number: CN111966405A
Application number: CN202010629868.3A
Authority: CN
Inventors: 李舒
Original assignee: Hangzhou Innovation Research Institute of Beihang University
Current assignee: Hangzhou Innovation Research Institute of Beihang University
Priority date: 2020-07-03
Filing date: 2020-07-03
Publication date: 2020-11-20
Anticipated expiration: 2040-07-03
Also published as: CN111966405B

Abstract

The invention discloses a GPU-based Polar code high-speed parallel decoding method, wherein the whole decoding process can be divided into three stages: the initialization stage, the decoding stage and the result returning stage specifically comprise: step 1: initializing a host; step 2: initializing a GPU; and step 3: the decoding kernel function carries out a plurality of times of loop iteration, and the maximum loop times are preset by a program; and 4, step 4: for the thread No. 0 of all thread blocks of the factor graph p _ good, the decoding result is obtained after inverse permutation is carried out on the Local _ L [ ] [0] + Local _ R [ ] [0] in the shared memory; and 5: the host transmits the decoded result from the GPU back to the host. The method comprises three levels of parallelism, namely parallelism among multiple subgraphs, multithreading blocks and multithreading. In addition, the method of the invention reduces the starting expense of the kernel function to the maximum extent; the memory access efficiency and the running speed are improved.

Description

Polar code high-speed parallel decoding method based on GPU

Technical Field

The invention belongs to the technical field of communication, and relates to a Polar code high-speed parallel decoding method based on a Graphics Processing Unit (GPU).

Background

Polar Codes proposed by Erdal Arikan in 2008 (reference [1]: Erdal Arikan, "Channel Polarization: a Method for Constructing Capacity-arrival Codes", IEEE ISIT2008) are currently the only Channel coding methods that can be strictly proven to reach shannon limit. Polar codes have been adopted officially by the 5G standardization organization. Decoding methods of Polar codes can be divided into two types: a serial cancellation based approach and a belief propagation based approach. The serial cancellation-based method has small calculation amount, but the algorithm is serial in nature, so the decoding delay is large; for the belief propagation-based method, in order to ensure the error correction performance of Polar code decoding, a belief propagation list algorithm, namely an iterative algorithm based on a plurality of permutation factor graphs, is generally adopted, so that the decoding method has a large operation amount, but the belief propagation list algorithm has the potential of parallel implementation.

On the other hand, GPU technology has been developed rapidly in recent years, and a commercial grade GPU card can have over 4000 cores for parallel processing, which provides a cost-effective hardware basis for parallel computing.

Disclosure of Invention

The invention aims to provide a GPU-based Polar code high-speed parallel decoding method to realize low-delay and high-throughput decoding.

The invention provides a Polar code high-speed parallel decoding method based on a GPU. The method comprises three levels of parallelism and can fully utilize the core resources on the GPU. The invention also designs an efficient distributed storage method, and improves the memory access efficiency and the running speed. The whole decoding process can be divided into three stages: initialization stage, decoding stage, and result returning stage. The initialization phase comprises the following steps 1 and 2, the decoding phase comprises the following steps 3 and 4, and the following step 5 is a result returning phase.

Step 1: and (4) initializing the host. Sequentially comprises the following steps: allocating memory space for information bit mark, factor graph replacement and inverse replacement information, signals received by a receiver, decoding results, namely log-likelihood ratio of source bits, initializing information and variables, storing the received signals and calculating log-likelihood ratio of coding bits.

Step 2: and initializing the GPU. Sequentially comprises the following steps: and (3) distributing global memory of the GPU, sending data to the GPU by the host, starting a parallel decoding thread of the GPU, distributing shared memory by the GPU, initializing the shared memory, and assigning values to the array of the shared memory according to the global memory.

And step 3: the decoding kernel function performs a plurality of loop iterations, and the maximum loop times are preset by a program. Each cycle comprises the following steps: exchanging thread block shared memory among L1 stage, L1-L2 stage, exchanging thread block shared memory among L2 stage, R1 stage, R1-R2 stage, R2 stage and judging cycle termination condition. If the factor graph meets the early termination condition during the loop or the maximum number of loops has been reached, the variable p _ good is set and the loop is terminated and jumps to step 4.

And 4, step 4: thread number 0, i.e., thread ((p _ good, b),0), for all thread blocks of the factor graph p _ good, where b is 0,1,.., N1-1,

n is the code length of Polar code, and is shared by Local _ L in the memory][0]+Local_R[][0]After inverse permutation, the result is used as the decoding result.

And 5: the host transmits the decoded result from the GPU back to the host.

Wherein, each loop of the loop iteration of the step 3 comprises the following steps:

step 3.1: the first stage of the left iteration, the L1 stage, includes an n-1,., n-n1 stage iteration, where n is log₂N，n1＝log₂N1；

Step 3.2: exchanging Local _ L [ ] [ n-n1+1] in the shared memory between the thread blocks of each factor graph through the global memory;

step 3.3: the second stage of the left iteration, the L2 stage, includes the nth-n 1-1.., 0 stage iterations;

step 3.4: the first stage of the right iteration, the R1 stage, includes the 0 th,.., n-n1-1 stage iterations;

step 3.5: exchanging Local _ R [ ] [ n-n1] in the shared memory between the thread blocks of each factor graph through the global memory;

step 3.6: the second phase of the right iteration, the R2 phase, includes the n-n 1.., n-1 level iterations;

step 3.7: and judging whether a factor graph meets an early termination condition or reaches the maximum cycle number, and setting a variable p _ good.

Wherein, the stages L1, L2, R1 and R2 in step 3 each include three levels of parallelism as follows:

the first level is parallelism between multiple factor graphs, each factor graph being responsible for N1 thread blocks. Because different factor graphs have no data dependency in the iterative process, the thread blocks of different factor graphs can naturally run in parallel.

The second level is parallelism of multi-threaded blocks of the same factor graph. When the code length N of the Polar code is larger, the core resources and the shared memory resources of a stream multiprocessor on the GPU cannot support the requirement of fully parallelizing a factor graph. For this reason, in the present invention, N1 thread blocks are responsible for the iteration of one factor graph. The invention divides the leftward propagation and the rightward propagation of each iteration into two stages, namely four stages. Left propagation includes iterations of the n-1, 0 stages, in two stages: the first stage is the (n-1) th stage to the (n-n) 1 th stage, which is called the L1 stage; the second stage is the n-n1-1 st stage to the 0 th stage, referred to as the L2 stage. The direction of propagation to the right and left is opposite, including iterations of levels 0, 1.., n-1, which are divided into two stages: the first stage is the 0 th stage to the n-n1-1 th stage, which is called the R1 stage; the second stage is the n-n1 th to n-1 st stages, referred to as the R2 stage.

The third level is multi-thread parallelism within the same thread block. The calculation of each thread block at each stage can be divided into N/N1/2-2^n-n1-1Subtasks, each subtask having no data dependency relationship therebetween, each subtask being divided into min (T, 2)^n-n1-1) Each group of subtasks is responsible for one thread in the thread block, and all threads can be executed in parallel; after each thread completes the set of subtasks for which it is responsible, thread synchronization within the thread block is performed.

The division details of the parallel of the same factor graph multithreading block in the second level and the multithreading parallel (namely, a plurality of subtasks) in the same threading block in the third level are as follows:

(1) in the stage L1, data of different thread blocks have no dependency relationship, and the thread blocks can run in parallel, namely multithreading blocks of the same factor graph in the stage L1 run in parallel; the same thread block has no data dependency among the sub-tasks at the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to be executed in parallel, namely L1 stage, the threads in the same thread block are executed in parallel;

(2) in the stage L2, data of different thread blocks have no dependency relationship, and the thread blocks can run in parallel, namely multithreading blocks of the same factor graph in the stage L2 run in parallel; the same thread block has no data dependency among the sub-tasks at the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to be executed in parallel, namely L2 stage, the threads in the same thread block are executed in parallel;

(3) in the stage R1, data of different thread blocks have no dependency relationship, and the thread blocks can run in parallel, namely the multithreading blocks of the same factor graph in the stage R1 run in parallel; the same thread block has no data dependency among the sub-tasks at the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to be executed in parallel, namely R1 stage, the threads in the same thread block are executed in parallel;

(4) in the stage R2, data of different thread blocks have no dependency relationship, and the thread blocks can run in parallel, namely the multithreading blocks of the same factor graph in the stage R2 run in parallel; the same thread block has no data dependency among the sub-tasks at the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to be executed in parallel, namely R2 stage, the threads in the same thread block are executed in parallel;

in step 2, when the GPU is initialized, the L and R arrays used in the decoding process are distributively stored in the shared memory of each thread block, that is, in a complete cycle process, the shared memory is exchanged for 2 times only by the thread blocks existing in the global, and all other operations can use the shared memory in the thread blocks. The method comprises the following specific steps:

the storage space on the GPU mainly includes a global memory and a shared memory within a thread block (referred to as an intra-block shared memory or a shared memory). The access speed of the shared memory is high, but the capacity is relatively small; the global memory space is large, and the access speed is low.

The main data of Polar code belief propagation decoding are matrices L and R, both of which are N (N + 1). When the code length N is large, the memory space required for L and R is large. In order to improve the memory access speed, the invention divides L and R in the calculation process, and stores the L and R in two-dimensional arrays of Local _ L and Local _ R in a shared memory in blocks of N thread blocks, wherein the first dimension and the second dimension of the Local _ L are respectively N/N1 and N +1, and the size of the Local _ R is the same as that of the Local _ L. The matrix L and R are distributed and stored in each thread block in a shared memory method as follows:

(1) for 0<＝j<＝n-n1，L_{b*(N/N1)+d2*N1+d1,j}And R_{b*(N/N1)+d2*N1+d1,j}Local _ L [ d2 × N1+ d1 stored in the shared memory of the b-th thread block][j]And Local _ R [ d 2N 1+ d1][j]Wherein b is 0,1,.., N1-1; d2 ═ 0, 1., N/(N1 × N1) -1; d1 ═ 0,1,. ang, N1-1;

(2) for n-n1<＝j<＝n，L_{b*(N/N1)+d2*N1+d1,j}And R_{b*(N/N1)+d2*N1+d1,j}Local _ L [ d 2N 1+ b ] stored in shared memory of the d1 th thread block][j+1]And Local _ R [ d 2N 1+ b][j+1]Wherein b is 0,1,.., N1-1; d2 ═ 0, 1., N/(N1 × N1) -1; d1 ═ 0, 1.., N1-1.

According to the distributed storage method, at each stage (s-n-1.., n-n1-1) of the L1 stage, the scheme of storing L and R in each thread block in a distributed manner is the same, so that each thread block only needs to use the shared memory in the block in each stage iteration in the L1 stage, and data does not need to be exchanged among the thread blocks. Similarly, iterations within the stages of L2, R1, and R2 also do not require data to be exchanged between thread blocks. The distributed storage schemes of the L1 and the L2 are different, so that data exchange between thread blocks existing in the whole office is needed between the L1 stage and the L2 stage; the distributed storage schemes of the L2 and R1 phases are the same, so that data exchange between thread blocks existing in the whole office is not needed between the L2 phase and the R1 phase. Similarly, data exchange between thread blocks existing in the whole office is needed between the stage of completing R1 and the stage of starting R2; data exchange between thread blocks through global presence is not required between the completion of the R2 phase and the start of the L1 phase for the next iteration. Therefore, in a complete cycle (including 4 stages and 2 n-level iterations), the shared memory is exchanged for 2 times only by the thread blocks existing in the global, and all other operations can use the shared memory in the thread blocks.

In order to improve the locality of access, the GPU global memory allocation described in step 2 is optimized as follows:

(1) the global memory includes common data for all factor graphs and private data for each factor graph. The global memory used by each factor graph is continuously stored. The global memory used by the pth thread block is a structure graph _ info [ p ], and the members of the structure graph _ info [ p ] include: the system comprises a factor graph permutation array, an inverse permutation array and a global memory space for exchanging the shared memory by the thread blocks.

(2) Graph _ info [ p ]. swap used for exchanging shared memory between thread blocks is stored continuously according to the order of reading thread blocks, i.e. when each thread block is read from graph _ info [ p ]. swap, its address space is continuous.

The invention has the advantages and positive effects that:

(1) the parallel decoding of the invention comprises three levels of parallelism: the first level is parallelism between multiple factor graphs, each factor graph being responsible for several thread blocks. Because different factor graphs have no data dependency relationship in the iteration process, thread blocks of different factor graphs can naturally run in parallel; the second level is the parallel of the multithread blocks of the same factor graph, in the invention, the iteration of one factor graph is responsible for N1 thread blocks, the thread blocks designed by the invention are divided into work to ensure that no data dependency exists among the thread blocks in the same stage (namely L1, L2, R1 and R2), so that the N1 thread blocks can be executed in parallel; the third layer is the multithread parallelism in the same thread block, the invention divides the iteration of the same thread block at each stage into N/2N1 subtasks which have no dependence on each other, then distributes the subtasks to a plurality of threads in the thread block to execute in parallel, and after each thread executes all the subtasks responsible for the stage, the synchronization in the thread block is carried out. The method can fully utilize the parallel core resources of the GPU and has high calculation efficiency. The invention has small synchronization overhead: in a complete cycle (namely comprising 4 stages and 2n levels of iteration), the invention only needs to use the synchronization between the thread blocks for 2 times, and the rest 2n-2 levels all use the synchronization mechanism in the thread blocks, so that the synchronization overhead is small. In addition, the whole decoding process on the GPU uses a kernel function, and the starting expense of the kernel function is reduced to the maximum extent.

(2) The invention stores the main data in iteration in the shared memory in each thread block in a distributed manner, thereby improving the memory access efficiency and the running speed. In a complete cycle (namely, the process comprises 4 stages and 2n levels of iteration), the shared memory is exchanged for 2 times only by the thread blocks in the whole process, and all other operations can use the shared memory in the thread blocks. In addition, in the global space used by the invention, the proprietary data of the same factor graph are continuously stored, and the global memory space used for exchanging the shared memory among the thread blocks is continuously stored according to the reading sequence of the thread blocks, thereby optimizing the storage locality and improving the memory access efficiency and the running speed.

Drawings

FIG. 1 is a flowchart of a GPU-based Polar code high-speed parallel decoding method according to the invention.

Detailed Description

The present invention will be described in detail with reference to the accompanying drawings.

The invention provides a GPU-based Polar code high-speed parallel decoding method, which adopts a decoding algorithm based on a belief propagation list. The coding method comprises three stages: initialization stage, decoding stage, and result returning stage. The initialization stage includes the following steps 1 and 2, the decoding stage includes the following steps 3 and 4, the following step 5 is a result feedback stage, and the whole decoding process is shown in fig. 1 and specifically includes the following steps:

step 1: and (4) initializing the host. Sequentially comprises the following steps: allocating memory space for information bit marks, factor graph replacement and inverse replacement information, signals received by a receiver, decoding results, namely log-likelihood ratios of source bits (step 1.1), initializing information and variables (step 1.2), storing the received signals and calculating log-likelihood ratios of coded bits (step 1.3); the method comprises the following specific steps:

step 1.1: allocating a host memory space, sequentially comprising: the information bit flag array InfoBitFlags, the signal array y received by the receiver, and the log-likelihood ratio array cLLR of the coded bits.

Step 1.2: the host initializes the information bit flag array InfoBitFlags, the factor graph permutation array Perm, and calculates the inverse permutation array InvPerm according to the factor graph permutation array.

Step 1.3: the host stores the signal received by the receiver in array y and the signal-to-noise ratio of the channel in variable SNR. The host calculates a code bit log likelihood ratio array cLLR y SNR from the received signal and the signal-to-noise ratio.

Step 2: and initializing the GPU. Sequentially comprises the following steps: distributing global memory of the GPU, sending data to the GPU by the host, starting a parallel decoding thread of the GPU, distributing shared memory by the GPU, initializing the shared memory, and assigning values to an array of the shared memory according to the global memory; the specific process is as follows:

step 2.1: and distributing the global memory on the GPU. Including common data for all factor graphs and proprietary data for each factor graph. The common data of all the factor graphs comprises a code bit log-likelihood ratio array cLLR, an information bit flag array InfoBitFlags and a decoding result, namely a source bit log-likelihood ratio array uLLR. The proprietary data of each factor graph is stored continuously, the proprietary data of the p-th factor graph is stored in a structure graph _ info [ p ], and the members of the proprietary data include: the system comprises a factor graph permutation array Perm, an inverse permutation array InvPerm and a global memory space swap for a thread block exchange shared memory.

Step 2.2: and sending an information bit flag array, a factor graph permutation array, a reverse permutation array and a coding bit log-likelihood ratio array of a host memory to the GPU, wherein the factor graph permutation array and the reverse permutation array are factor graph exclusive data and are stored in a structure graph _ info [ p ].

Step 2.3: the host starts parallel decoding threads of the GPU, and the number of the thread blocks is P × N1, each thread block comprises T threads, wherein T is equal to the number of cores contained in each streaming multiprocessor. All threads execute the same decoding kernel function, and the threads are distinguished by thread indexes, wherein the index of each thread is ((p, b), t), wherein (p, b) is the thread block index, and t is the thread index in the thread block.

Step 2.4: the thread of the GPU allocates shared memory in the thread blocks, and the shared memory comprises two-dimensional arrays of Local _ L [ N/N1] [ N +2] and Local _ R [ N/N1] [ N +2], and all elements are initialized to 0.

Step 2.5: thread number 0, i.e., thread ((P, b),0) (P-0, 1.., P-1; b-0, 1.., 2), of each thread block on the GPUⁿ¹-1) giving Local _ L [ 2] a bit log-likelihood ratio based on the information bit flag and the code bit][n+2]And Local _ R [ 2]][0]And (7) assigning values.

In the program, this step can be implemented by a recirculation, and the specific flow is as follows:

(1) the loop index variable is d, d is 0,1

(2) Calculate dd ═ (d% N1) × N1+ (d/N1)

(3) The global memory address corresponding to Local _ L [ d ] [ N +2] in the Local memory is dd × N1+ b, and the index of the cause sub-map (i.e., the factor sub-map before replacement) is graph _ info [ p ]. InvPerm [ dd × N1+ b ], so that the value of cblr [ graph _ info [ p ]. InvPerm [ dd × N1+ b ] is assigned to Local _ L [ d ] [ N +2 ].

(4) The global memory address corresponding to Local _ R [ d ] [0] in the Local memory is b × N1+ d, and the index in the cause graph (i.e., the factor graph before replacement) is graph _ info [ p ]. InvPerm [ b × N1+ d ]. If the information bit flag InfoBitFlags [ graph _ info [ p ]. InvPerm [ b. N1+ d ] ] is 0, Local _ R [ d ] [0] is set to 1e + 30.

An example procedure is as follows:

the index is ((P, b),0), P ═ 0,1,. ang., P-1; b is 0,1, 2ⁿ¹-1 threads share P2ⁿ¹P2 ofⁿ¹Threads may be executed in parallel.

And step 3: the decoding kernel function performs a plurality of loop iterations, and the maximum loop times are preset by a program. Each loop includes stages L1 (step 3.1), exchanging thread block shared memory between stages L1-L2 (step 3.2), L2 (step 3.3), R1 (step 3.4), exchanging thread block shared memory between stages R1-R2 (step 3.5), R2 (step 3.6), and determining whether a factor graph satisfies an early termination condition or has reached a maximum number of loops, and setting a variable p _ good (step 3.7).

Step 3.1: the first stage of the left iteration, the L1 stage, has the stage number s-n-1.

Each level contains three levels of parallelism:

(1) parallelism between multi-factor graphs: the p-th factor graph is indexed by (p,0)ⁿ¹-1) is responsible for P-0, 1. The thread blocks of each factor graph are independent and can be performed in parallel;

(2) parallel of multithread blocks of the same factor graph: each factor graph is composed of 2ⁿ¹Each thread block is responsible for computation, and the b (b) is 0,1ⁿ¹-1) the first set of dimension indices of L and R used by the thread blocks is L1_ block (b) { ia × 2 }ⁿ¹+b,ia＝0,...,2^n-n1-1} so there is no data dependency between thread blocks and it can run in parallel.

(3) Multithreading in the same factor graph thread block is parallel: the calculation of each thread block at each stage can be decomposed into N/N1/2-2^n-n1-1And (5) subtasks. Numbered i (i ═ 0, 1.., 2.)^n-n1-1The subtask of 1) uses a first set of indices of L and R at level s as:

L1_Task(b,i,s)＝{floor(i/2^s-n1)*2^s+1+a*2^s+(i％2^s-n1)*2ⁿ¹+b:a＝0,1}；

therefore, there is no data dependency relationship between the subtasks in the same thread block, so that these subtasks can be divided into min (T, 2)^n-n1-1) And each group of subtasks is responsible for one thread in the thread block, and all threads can execute in parallel. Each thread completes what it is responsible forAnd after the subtasks are grouped, thread synchronization in the thread blocks is carried out.

According to the thread and subtask division scheme of the present invention, the ith subtask of a thread block (p, b) at the s-th level requires the computation of:

L_up,s＝f(L_down,s+1+R_down,s,L_up,s+1)

L_down,s＝L_down,s+1+f(L_up,s+1,R_up,s)

wherein

up＝floor(i/2^s-n1)*2^s+1+(i％2^s-n1)*2ⁿ¹+b

down＝floor(i/2^s-n1)*2^s+1+2^s+(i％2^s-n1)*2ⁿ¹+b

According to the distributed storage scheme of the present invention, in stage L1, the indexes of up and down in the shared memory are Local _ up and Local _ down, respectively, and the indexes of s and s +1 in the shared memory are s +1 and s +2, respectively, where Local _ up is floor (i'/2)^j)*2^j+1+(i’％2^j),Local_down＝floor(i’/2^j)*2^j+1+2^j+(i’％2^j),i’＝(i％2ⁿ ^-2n1)*2^n-2n1+floor(i/2^n-2n1),i’＝0,1,...,2^n-n1-1-1, j-s-n + n1, Local _ up and Local _ down being the shared memory address of the thread block (p, b).

Therefore, the above calculation program can be expressed as:

Local_L[Local_up][s+1]＝f(Local_L[Local_down][s+2]+Local_R[Local_down][s+1],Local_L[up][s+2])；

Local_L[Local_down][s+1]＝Local_L[Local_down][s+2]+f(Local_L[Local_up][s+2],Local_R[up][s+1])；

the function f is defined as: f (x, y) ═ 2tanh^-1(tanh (x/2) tanh (y/2)), in practical calculations, it is usually approximated by f (x, y) ≈ 0.9375sgn (x) sgn (y) min (| x |, | y |).

In the program, the step can be realized by two cycles, and the specific flow is as follows:

(1) the loop index variable for the first iteration is s, s-n-1. (Note that in this loop, the loop index variable s is decremented by 1 each time.)

(2) Calculate j-s-n + n1.

(3) The loop index variable for the second repeat loop is i, i ═ T, T + T^n-n1-1-1-t)/T)*T。

(4) The binary representation of i includes n-n1-1 bits, i% being the lower j bits of i (2)^j) Assigning variable i _ LSB; and the high (n-n1-1-j) bit of i, i.e. floor (i/(2)^j) Is assigned to the variable i _ MSB.

(5) The shared memory addresses, Local _ up and Local _ down, are calculated, where Local _ up ═ (i _ MSB < (j +1)) + i _ LSB, and Local _ down ═ Local _ up + (1< < j).

(6) Calculating Local _ L [ Local _ up ] [ s +1] and Local _ L [ Local _ down ] [ s +1], wherein the former has a value of f (Local _ L [ Local _ down ] [ s +2] + Local _ R [ Local _ down ] [ s +1], Local _ L [ up ] [ s +2]), and the latter has a value of Local _ L [ Local _ down ] [ s +1] } ═ Local _ L [ Local _ down ] [ s +2] +

f(Local_L[Local_up][s+2],Local_R[up][s+1])。

(7) After the second recirculation is completed, __ synchreads () is called to synchronize the threads within the thread block.

An example procedure is as follows:

step 3.2: exchanging Local _ L [ ] [ n-n1+1] in the shared memory between the thread blocks of each factor graph through a global memory, namely exchanging the thread blocks between L1-L2 stages to share the memory; the method comprises the following specific steps:

step 3.2.1 thread number 0, i.e. thread ((P, b),0) (P0, 1, P1, b 0,1, 2, per thread blockⁿ¹-1) sharing the thread blocks (p, b) to Local _ L [ d2 x 2] in memoryⁿ¹+d1][n-n1+1]Write to graph _ info [ p ] in global memory].swap[d1*2^n-n1+d2*2ⁿ¹+b]Wherein d1 is 0,1ⁿ¹-1；d2＝0,1,...,2^n-2n1-1. No ((P, b),0), P ═ 0,1,. ang, P-1; b is 0,1, 2ⁿ¹-1 threads share P2ⁿ¹And (4) respectively. Since there is no overlap between the addresses where threads write swaps, P2ⁿ¹Threads may be executed in parallel.

Step 3.2.2 2 of each factor graph 2ⁿ¹The thread blocks perform synchronization between the thread blocks. Synchronization is not required between thread blocks of different factor graphs.

Step 3.2.3 thread block (p, d1) thread # 0 ((p, d1),0) will map _ info [ p ] in global memory].swap[d1*2^n-n1+d2*2ⁿ¹+b]Write thread blocks (p, d1) share Local _ L [ d2 x 2] in memoryⁿ¹+b][n-n1]Wherein b is 0,1, 2ⁿ¹-1；d1＝0,1,...,2ⁿ¹-1；d2＝0,1,...,2^n-2n1-1. No ((P, d1),0), P ═ 0,1,. ang, P-1; d1 ═ 0, 1.., 2ⁿ¹-1 threads share P2ⁿ¹P2 ofⁿ¹Threads may be executed in parallel.

Step 3.3: the second stage of the left iteration, stage L2, has stage number s-n 1-1.

Each level also contains three levels of parallelism:

(1) parallelism between multi-factor graphs: the p-th factor graph is numbered (p, 0., (p, 2.)ⁿ¹-1) is responsible for P-0, 1. The thread blocks of each factor graph are independent and can be performed in parallel;

(2) parallel of multithread blocks of the same factor graph: each factor graph is composed of 2ⁿ¹Each thread block is responsible for computation, and the b (b) is 0,1ⁿ¹-1) the first set of dimension indices of L and R used by the thread blocks is L2_ block (b) { b × 2 }^n-n1+ia,ia＝0,...,2^n-n1-1} so there is no data dependency between thread blocks and it can run in parallel.

(3) Multithreading in the same factor graph thread block is parallel: the calculation of each thread block at each stage can be decomposed into N/N1/2-2^n-n1-1And (5) subtasks. Numbered i (i ═ 0, 1.., 2.)^n-n1-1The subtask of 1) uses the first set of indices of L and R at the s-th level as

L2_Task(b,i,s)＝{b*2^n-n1+floor(i/2^s)*2^s+1+a*2^s+(i％2^s):a＝0,1}

Therefore, there is no data dependency relationship between the subtasks in the same thread block, so that these subtasks can be divided into min (T, 2)^n-n1-1) And each group of subtasks is responsible for one thread in the thread block, and all threads can execute in parallel. After each thread completes the set of subtasks for which it is responsible, thread synchronization within the thread block is performed.

L_up,s＝f(L_down,s+1+R_down,s,L_up,s+1)

L_down,s＝L_down,s+1+f(L_up,s+1,R_up,s)

wherein,

up＝b*2^n-n1+floor(i/2^s)*2^s+1+(i％2^s)

down＝b*2^n-n1+floor(i/2^s)*2^s+1+2^s+(i％2^s)

according to the distributed storage scheme of the present invention, in stage L2, the indexes of up and down in the shared memory are Local _ up and Local _ down, respectively, and the indexes of s and s +1 in the shared memory are s and s +1, respectively, where Local _ up is floor (i/2)^s)*2^s+1+(i％2^s),Local_down＝floor(i/2^s)*2^s+1+2^s+(i％2^s) Local _ up and Local _ down are the shared memory addresses of the thread blocks (p, b).

Therefore, the above calculation program can be expressed as:

Local_L[Local_up][s]＝f(Local_L[Local_down][s+1]+Local_R[Local_down][s],Local_L[up][s+1])；

Local_L[Local_down][s]＝Local_L[Local_down][s+1]+f(Local_L[Local_up][s+1],Local_R[up][s])；

(1) the loop index variable for the first iteration is s, s-n 1, n-n1-1. (Note that in this loop, the loop index variable s is decremented by 1 each time.)

(2) The loop index variable for the second repeat loop is i, i ═ T, T + T^n-n1-1-1-t)/T)*T。

(3) The binary representation of i comprises n-n1-1 bits, i% being the lower s bits of i (2)^s) Assigning variable i _ LSB; and the high (n-n1-1-s) bit of i, i.e. floor (i/(2)^j) Is assigned to the variable i _ MSB.

(4) The shared memory addresses, Local _ up and Local _ down, are calculated, where Local _ up ═ (i _ MSB < (s +1)) + i _ LSB, and Local _ down ═ Local _ up + (1< < s).

(5) Calculating Local _ L [ Local _ up ] [ s ] and Local _ L [ Local _ down ] [ s ], wherein the former has a value of f (Local _ L [ Local _ down ] [ s +1] + Local _ R [ Local _ down ] [ s ], Local _ L [ Local _ up ] [ s +1]), and the latter has a value of Local _ L [ Local _ down ] [ s +1] -, L [ Local _ down ] [ Local _ s +1] +

f(L[Local_up][s+1],Local_R[Local_up][s])。

(6) After the second recirculation is completed, __ synchreads () is called to synchronize the threads within the thread block.

An example procedure is as follows:

step 3.4: the first stage of the right iteration, the R1 stage, has a stage number s of 0.

Each level also contains three levels of parallelism:

(2) parallel of multithread blocks of the same factor graph: each factor graph is composed of 2ⁿ¹Each thread block is responsible for computation, and the b (b) is 0,1ⁿ¹-1) the first set of dimension indices of L and R used by the thread blocks is R1_ block (b) { b × 2 }^n-n1+ia,ia＝0,...,2^n-n1-1} so there is no data dependency between thread blocks and it can run in parallel.

(3) Multithreading in the same factor graph thread block is parallel: the calculation of each thread block at each stage can be decomposed into N/N1/2-2^n-n1-1And (5) subtasks. Numbered i (i ═ 0, 1.., 2.)^n-n1-1The subtask of-1) uses a first set of indices of L and R at stage s as R1_ Task (b, i, s) { b × 2 }^n-n1+floor(i/2^s)*2^s+1+a*2^s+(i％2^s):a＝0,1}

R_up,s+1＝f(L_down,s+1+R_down,s,R_up,s)

R_down,s+1＝R_down,s+f(L_up,s+1,R_up,s)

wherein,

up＝b*2^n-n1+floor(i/2^s)*2^s+1+(i％2^s)

down＝b*2^n-n1+floor(i/2^s)*2^s+1+2^s+(i％2^s)

according to the distributed storage scheme of the invention, in the stage R1, the indexes of up and down in the shared memory are Local _ up and Local _ down, respectively, and the indexes of s and s +1 in the shared memory are s and s +1, respectively, wherein the Local _ up is floor (i/2)^s)*2^s+1+(i％2^s),Local_down＝floor(i/2^s)*2^s+1+2^s+(i％2^s) Local _ up and Local _ down are the shared memory addresses of the thread blocks (p, b).

Therefore, the above calculation program can be expressed as:

Local_R[Local_up][s+1]＝

f(Local_L[Local_down][s+1]+Local_R[Local_down][s],R[Local_up][s])；

Local_R[Local_down][s+1]＝Local_R[Local_down][s]+

f(L[Local_up][s+1],R[Local_up][s])；

(1) the loop index variable for the first iteration is s, s-0, 1, n-n1.

(3) The binary representation of i comprises n-n1-1 bits, i% being the lower s bits of i (2)^s) Assigning variable i _ LSB; and the high (n-n1-1-s) bit of i, i.e. floor (i/(2)^s) Is assigned to the variable i _ MSB.

(4) Calculating the shared memory addresses of Local _ up and Local _ down, wherein the Local _ up ═ i _ MSB < >

(s+1))+i_LSB，Local_down＝Local_up+(1<<s)。

(5) Local _ L [ Local _ up ] [ s +1] and Local _ L [ Local _ down ] [ s +1] are calculated, where the former has a value of f (Local _ L [ Local _ down ] [ s +1] + Local _ R [ Local _ down ] [ s ], Local _ R [ Local _ up ] [ s ]), and the latter has a value of Local _ R [ Local _ down ] [ s ] + f (L [ Local _ up ] [ s +1], R [ Local _ up ] [ s ]).

An example procedure is as follows:

step 3.5: the thread blocks of each factor graph exchange the Local _ R [ ] [ n-n1] in the shared memory through the global memory, namely the thread blocks are exchanged among the R1-R2 stages to share the memory, which is concretely as follows:

step 3.5.1 thread number 0, i.e. thread ((P, b),0) (P0, 1, P1, b 0,1, 2, per thread blockⁿ¹-1) sharing the thread blocks (p, b) to Local _ R [ d2 x 2] in memoryⁿ¹+d1][n-n1]Write to graph _ info in global memory[p].swap[d1*2^n-n1+d2*2ⁿ¹+b]Wherein d1 is 0,1ⁿ¹-1；d2＝0,1,...,2^n-2n1-1. No ((P, b),0), P ═ 0,1,. ang, P-1; b is 0,1, 2ⁿ¹-1 threads share P2ⁿ¹And (4) respectively. Since there is no overlap between the addresses where threads write swaps, P2ⁿ¹Threads may be executed in parallel.

Step 3.5.2 2 of each factor graphⁿ¹The thread blocks perform synchronization between the thread blocks. Synchronization is not required between thread blocks of different factor graphs.

Step 3.5.3 thread block (p, d1) thread # 0 ((p, d1),0) maps _ info [ p ] in global memory].swap[d1*2^n-n1+d2*2ⁿ¹+b]Write thread block shared memory Local _ R [ d2 x 2ⁿ¹+b][n-n1+1]Wherein b is 0,1, 2ⁿ¹-1；d1＝0,1,...,2ⁿ¹-1；d2＝0,1,...,2^n-2n1-1. No ((P, d1),0), P ═ 0,1,. ang, P-1; d1 ═ 0, 1.., 2ⁿ¹-1 threads share P2ⁿ¹P2 ofⁿ¹Threads may be executed in parallel.

Step 3.6: the second stage of the right iteration, the R2 stage, has the stage number s-n 1.

Each level also contains three levels of parallelism:

(2) parallel of multithread blocks of the same factor graph: each factor graph is composed of 2ⁿ¹Each thread block is responsible for computation, and the b (b) is 0,1ⁿ¹-1) the first set of dimension indices of L and R used by the thread blocks is R2_ block (b) { ia × 2 }ⁿ¹+b,ia＝0,...,2^n-n1-1} so there is no data dependency between thread blocks and it can run in parallel.

(3) Multithreading in the same factor graph thread block is parallel: the calculation of each thread block at each stage can be decomposed into N/N1/2-2^n-n1-1And (5) subtasks. Numbered i (i ═ 0, 1.., 2.)^n-n1-11) of L and R used by the subtasks at level sThe first dimension index set is:

R2_Task(b,i,s)＝{floor(i/2^s-n1)*2^s+1+a*2^s+(i％2^s-n1)*2ⁿ¹+b:a＝0,1}

R_up,s+1＝f(L_down,s+1+R_down,s,R_up,s)

R_down,s+1＝R_down,s+f(L_up,s+1,R_up,s)

wherein,

up＝floor(i/2^s-n1)*2^s+1+(i％2^s-n1)*2ⁿ¹+b

down＝floor(i/2^s-n1)*2^s+1+2^s+(i％2^s-n1)*2ⁿ¹+b

according to the distributed storage scheme of the present invention, in stage R2, the indices of up and down in the shared memory are Local _ up and Local _ down, respectively, and the indices of s and s +1 in the shared memory are s +1 and s +2, respectively, where Local _ up is floor (i'/2)^j)*2^j+1+(i’％2^j),Local_down＝floor(i’/2^j)*2^j+1+2^j+(i’％2^j),i’＝(i％2ⁿ ^-2n1)*2^n-2n1+floor(i/2^n-2n1),i’＝0,1,...,2^n-n1-1-1, j-s-n + n1, Local _ up and Local _ down being the shared memory address of the thread block (p, b).

Therefore, the above calculation program can be expressed as:

Local_R[Local_up][s+2]＝f(Local_L[Local_down][s+2]+Local_R[Local_down][s+1],[Local_up][s+1])；

Local_R[Local_down][s+2]＝Local_R[Local_down][s+1]+

f(L[Local_up][s+2],R[Local_up][s+1])；

(1) the loop index variable for the first iteration is s, s — n1.

(2) Calculate j-s-n + n1.

(6) Local _ L [ Local _ up ] [ s +2] and Local _ L [ Local _ down ] [ s +2] are calculated, wherein the former has a value of f (Local _ L [ Local _ down ] [ s +2] + Local _ R [ Local _ down ] [ s +1], Local _ R [ Local _ up ] [ s +1]), and the latter has a value of Local _ R [ Local _ down ] [ s +1] + f (L [ Local _ up ] [ s +2], Local _ R [ Local _ up ] [ s +1 ]).

An example procedure is as follows:

step 3.7: and judging whether the iteration result of each factor graph meets the early termination condition. If at least one factor graph satisfies the condition, the number p of the factor graph is recorded in the variable p _ good (if a plurality of factor graphs satisfy the condition, any p satisfying the condition is recorded in the variable p _ good), and the loop terminates, i.e., jumps to step 4. Otherwise, that is, all the factor graphs do not satisfy the condition, judging whether the preset maximum cycle number has been reached, if the maximum cycle number has been reached, making p _ good equal to 0 (corresponding to the first factor graph), and jumping to the step 4 after the cycle is terminated; if the preset maximum loop times are not reached, the loop is continued, i.e. the step 3.1 is skipped. The early termination condition may have various conditions, such as no improvement in iteration, passing of an additional CRC check, etc. Wherein the maximum loop times may be selected to be the same as the serialization implementation of Polar code belief propagation decoding, and is typically between 50 and 200.

And 4, step 4: the factor graph p _ good is a thread 0, i.e. a thread ((p _ good, b),0), where b is 0,1,., N1-1, and stores Local _ L [ ] [0] + Local _ R [ ] [0] in the llr as a decoding result after inverse permutation in the shared memory. There are N1 threads indexed ((p _ good, b),0) that can be executed in parallel.

(1) the loop index variable is d, d is 0,1

(2) The global memory address corresponding to Local [ L ] [0] and Local _ R [ d ] [0] in the Local memory is b × N1+ d, and the index in the cause graph (i.e., the factor graph before replacement) is graph _ info [ p _ good ]. InvPerm [ b × N1+ d ]. Local _ L [ d ] [0] and Local _ R [ d ] [0] are added, and the result is stored in a uLLR [ graph _ info [ p _ good ]. InvPerm [ b ] N1+ d ].

An example procedure is as follows:

for(d＝0；d<N/N1；d++)

uLLR[graph_info[p_good].InvPerm[b*N1+d]]＝

Local_L[d][0]+Local_R[d][0]；

and 5: the host transmits the decoded result, i.e., the log-likelihood ratios of the source bits, llr back from the GPU to the host.

Claims

1. A Polar code high-speed parallel decoding method based on GPU is characterized in that: the whole decoding process can be divided into three stages: the method comprises an initialization stage, a decoding stage and a result returning stage, wherein the initialization stage comprises the following steps 1 and 2, the decoding stage comprises the following steps 3 and 4, and the following step 5 is the result returning stage:

step 1: host initialization

Sequentially comprises the following steps: allocating memory space for information bit marks, factor graph replacement and inverse replacement information, signals received by a receiver, decoding results, namely log-likelihood ratios of source bits, initializing information and variables, storing the received signals and calculating log-likelihood ratios of coding bits;

step 2: GPU initialization

Sequentially comprises the following steps: distributing global memory of the GPU, sending data to the GPU by the host, starting a parallel decoding thread of the GPU, distributing shared memory by the GPU, initializing the shared memory, and assigning values to an array of the shared memory according to the global memory;

and step 3: the decoding kernel function performs a plurality of loop iterations, and the maximum loop number is preset by the program

Each cycle comprises the following steps: the judgment of the thread block shared memory among the L1 stage, the L1-L2 stages, the L2 stage, the R1 stage, the R1-R2 stages, the R2 stage and the cycle termination condition: if the factor graph meets the early termination condition in the circulation process or the maximum circulation times are reached, setting a variable p _ good, terminating the circulation and jumping to the step 4;

n is the code length of Polar code, and is shared by Local _ L in the memory][0]+Local_R[][0]After inverse permutation, the decoding result is obtained;

and 5: the host transmits the decoded result from the GPU back to the host.

2. The method for high-speed parallel decoding of Polar codes based on GPU as claimed in claim 1, wherein: in step 2, when the GPU is initialized, the L and R arrays used in the decoding process are distributively stored in the shared memory of each thread block, that is, in a complete cycle process, the shared memory is exchanged for 2 times only by the thread blocks existing in the global, and all other operations can use the shared memory in the thread blocks.

3. The method for high-speed parallel decoding of Polar codes based on GPU as claimed in claim 1, wherein: the allocating of the global memory in step 2 specifically includes: the global memories used in the same factor graph are stored continuously, and the global memory spaces used for exchanging the shared memory among the thread blocks are stored continuously according to the reading sequence of the thread blocks, that is, when each thread block is read from the exchange space, the read address spaces are continuous.

4. The method for high-speed parallel decoding of Polar codes based on GPU as claimed in claim 1, wherein: each loop of the step 3 loop iteration comprises the following steps:

step 3.1: the first stage of the left iteration, the L1 stage, includes an n-1,.., n-n1 stage iteration, where n1 is log₂N1；

5. The method for high-speed parallel decoding of Polar codes based on GPU as claimed in claim 4, wherein: the stages L1, L2, R1 and R2 in step 3 each include three levels of parallelism:

the first layer is the parallelism among multiple factor graphs, each factor graph is responsible for by N1 thread blocks, the thread blocks of each factor graph are independent, and the thread blocks of different factor graphs run in parallel;

the second layer is the parallel of the multi-thread blocks of the same factor graph, each factor graph is calculated by N1 thread blocks, and different thread blocks have no data dependence and run in parallel;

the third level is multithread parallelism in the same thread block, and the calculation of each thread block in each level can be divided into N/N1/2-2^n-n1-1Subtasks, each subtask having no data dependency relationship therebetween, each subtask being divided into min (T, 2)^n-n1-1) The group comprises a group, wherein T is the number of cores contained in each streaming multiprocessor on the GPU, each group of subtasks is responsible for one thread in a thread block, and all threads can be executed in parallel; after each thread completes the set of subtasks for which it is responsible, thread synchronization within the thread block is performed.

6. The method for high-speed parallel decoding of Polar codes based on GPU as claimed in claim 5, wherein: the division details of the parallel of the same factor graph multithreading block in the second level and the multithreading parallel in the same threading block in the third level are as follows:

(4) in the stage R2, data of different thread blocks have no dependency relationship, and the thread blocks can run in parallel, namely the multithreading blocks of the same factor graph in the stage R2 run in parallel; the same thread block has no data dependency relationship among the sub-tasks of the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to execute in parallel, namely R2 stage, and multiple threads in the same thread block execute in parallel.