CN111966405B - Polar code high-speed parallel decoding method based on GPU - Google Patents

Polar code high-speed parallel decoding method based on GPU Download PDF

Info

Publication number
CN111966405B
CN111966405B CN202010629868.3A CN202010629868A CN111966405B CN 111966405 B CN111966405 B CN 111966405B CN 202010629868 A CN202010629868 A CN 202010629868A CN 111966405 B CN111966405 B CN 111966405B
Authority
CN
China
Prior art keywords
thread
stage
local
parallel
blocks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010629868.3A
Other languages
Chinese (zh)
Other versions
CN111966405A (en
Inventor
李舒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Innovation Research Institute of Beihang University
Original Assignee
Hangzhou Innovation Research Institute of Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Innovation Research Institute of Beihang University filed Critical Hangzhou Innovation Research Institute of Beihang University
Priority to CN202010629868.3A priority Critical patent/CN111966405B/en
Publication of CN111966405A publication Critical patent/CN111966405A/en
Application granted granted Critical
Publication of CN111966405B publication Critical patent/CN111966405B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/03Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words
    • H03M13/05Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits
    • H03M13/13Linear codes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L1/00Arrangements for detecting or preventing errors in the information received
    • H04L1/004Arrangements for detecting or preventing errors in the information received by using forward error control
    • H04L1/0056Systems characterized by the type of code used
    • H04L1/0057Block codes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention discloses a GPU-based Polar code high-speed parallel decoding method, wherein the whole decoding process can be divided into three stages: the initialization stage, the decoding stage and the result returning stage specifically comprise: step 1: initializing a host; step 2: initializing a GPU; and 3, step 3: the decoding kernel function carries out a plurality of times of loop iteration, and the maximum loop times are preset by a program; and 4, step 4: for the thread No. 0 of all thread blocks of the factor graph p _ good, the decoding result is obtained after inverse permutation is carried out on the Local _ L [ ] [0] + Local _ R [ ] [0] in the shared memory; and 5: the host transmits the decoded result from the GPU back to the host. The method comprises three levels of parallelism, namely parallelism among multiple subgraphs, between multithreading blocks and between multiple threads. In addition, the method of the invention reduces the starting expense of the kernel function to the maximum extent; the memory access efficiency and the running speed are improved.

Description

Polar code high-speed parallel decoding method based on GPU
Technical Field
The invention belongs to the technical field of communication, and relates to a Polar code high-speed parallel decoding method based on a Graphics Processing Unit (GPU).
Background
Polar Codes proposed by Erdal Arikan in 2008 (reference [1]: Erdal Arikan, "Channel Polarization: a Method for Constructing Capacity-influencing Codes", IEEE ISIT2008) are the only Channel coding methods that can be strictly proven to reach shannon limit at present. Polar codes have been adopted officially by the 5G standardization organization. Decoding methods of Polar codes can be divided into two types: a serial cancellation based approach and a belief propagation based approach. The serial offset-based method has small operation amount, but the algorithm is serial in nature, so the decoding delay is large; for the belief propagation-based method, in order to ensure the error correction performance of Polar code decoding, a belief propagation list algorithm, namely an iterative algorithm based on a plurality of permutation factor graphs, is generally adopted, so that the decoding method has a large operation amount, but the belief propagation list algorithm has the potential of parallel implementation.
On the other hand, the GPU technology has been developed rapidly in recent years, and a commercial grade GPU card can have over 4000 cores for parallel processing, which provides a cost-effective hardware basis for parallel computing.
Disclosure of Invention
The invention aims to provide a GPU-based Polar code high-speed parallel decoding method to realize low-delay and high-throughput decoding.
The invention provides a Polar code high-speed parallel decoding method based on a GPU. The method comprises three layers of parallelism, and can fully utilize the core resources on the GPU. The invention also designs an efficient distributed storage method, and improves the memory access efficiency and the running speed. The whole decoding process can be divided into three stages: initialization stage, decoding stage, and result returning stage. The initialization phase comprises the following steps 1 and 2, the decoding phase comprises the following steps 3 and 4, and the following step 5 is a result returning phase.
Step 1: and (4) initializing the host. Sequentially comprises the following steps: allocating memory space for information bit mark, factor graph replacement and inverse replacement information, signals received by a receiver, decoding results, namely log-likelihood ratio of source bits, initializing information and variables, storing the received signals and calculating log-likelihood ratio of coding bits.
And 2, step: and initializing the GPU. Sequentially comprises the following steps: and (3) distributing global memory of the GPU, sending data to the GPU by the host, starting a parallel decoding thread of the GPU, distributing shared memory by the GPU, initializing the shared memory, and assigning values to the array of the shared memory according to the global memory.
And 3, step 3: the decoding kernel function performs a plurality of loop iterations, and the maximum loop times are preset by a program. Each cycle comprises the following steps: and exchanging thread block shared memories between the L1 stage and the L1-L2 stage, exchanging thread block shared memories between the L2 stage and the R1 stage, and exchanging thread block shared memories between the R1-R2 stages, and judging the R2 stage and the cycle termination condition. If the factor graph meets the early termination condition during the loop or the maximum number of loops has been reached, the variable p _ good is set and the loop is terminated and jumps to step 4.
And 4, step 4: thread number 0, i.e., thread ((p _ good, b),0), for all thread blocks of the factor graph p _ good, where b is 0,1,.., N1-1,
Figure BDA0002568040640000021
n is the code length of Polar code, and is shared with Local _ L [ 2] in the memory][0]+Local_R[][0]After inverse permutation, the result is used as the decoding result.
And 5: the host transmits the decoded result from the GPU back to the host.
Wherein, each loop of the loop iteration of the step 3 comprises the following steps:
step 3.1: the first stage of the left iteration, the L1 stage, includes an n-1, n-n1 stage iteration, where n is log 2 N,n1=log 2 N1;
Step 3.2: exchanging Local _ L [ ] [ n-n1+1] in the shared memory between the thread blocks of each factor graph through a global memory;
step 3.3: the second stage of the left iteration, the L2 stage, includes the nth-n 1-1.., 0 stage iterations;
step 3.4: the first stage of the right iteration, the R1 stage, includes the 0 th,.., n-n1-1 stage iterations;
step 3.5: exchanging Local _ R [ ] [ n-n1] in the shared memory between the thread blocks of each factor graph through the global memory;
step 3.6: the second phase of the right iteration, the R2 phase, includes the n-n1, n-1 level iterations;
step 3.7: and judging whether a factor graph meets an early termination condition or reaches the maximum cycle number, and setting a variable p _ good.
Wherein, the stages L1, L2, R1 and R2 in step 3 each include three levels of parallelism as follows:
the first level is parallelism between multiple factor graphs, each factor graph being responsible for N1 thread blocks. Because different factor graphs have no data dependencies in the iterative process, the thread blocks of different factor graphs can naturally run in parallel.
The second level is parallelism of multi-threaded blocks of the same factor graph. When the code length N of the Polar code is larger, the core resources and the shared memory resources of a stream multiprocessor on the GPU cannot support the requirement of fully parallelizing a factor graph. For this reason, in the present invention, N1 thread blocks are responsible for the iteration of one factor graph. The invention divides the leftward propagation and the rightward propagation of each iteration into two stages, namely four stages. Left propagation includes iterations of the n-1, 0 stages, in two stages: the first stage is the (n-1) th stage to the (n-n) 1 th stage, which is called the L1 stage; the second stage is the n-n1-1 th to 0 th stages, called the L2 stage. The direction of propagation to the right and left is opposite, including iterations of levels 0, 1.., n-1, which are divided into two stages: the first stage is the 0 th stage to the n-n1-1 th stage, which is called the R1 stage; the second stage is the (n-n) 1 th to (n-1) th stages, called the R2 stage.
The third level is multi-thread parallelism within the same thread block. The calculation of each thread block at each stage can be divided into N/N1/2-2 n-n1-1 Subtasks, each subtask having no data dependency relationship therebetween, each subtask being divided into min (T, 2) n-n1-1 ) Each group of subtasks is responsible for one thread in the thread block, and all threads can be executed in parallel; after each thread completes the set of subtasks for which it is responsible, thread synchronization within the thread block is performed.
The division details of the parallel of the same factor graph multithreading block in the second level and the multithreading parallel (namely, a plurality of subtasks) in the same threading block in the third level are as follows:
(1) in the stage L1, data of different thread blocks have no dependency relationship, and all the thread blocks can run in parallel, namely multithreading blocks of the same factor graph in the stage L1 run in parallel; the same thread block has no data dependency relationship among the sub-tasks of the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to be executed in parallel, namely, multiple threads in the same thread block are executed in parallel in the L1 stage;
(2) in the stage L2, data of different thread blocks have no dependency relationship, and all the thread blocks can run in parallel, namely multithreading blocks of the same factor graph in the stage L2 run in parallel; the same thread block has no data dependency relationship among the sub-tasks of the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to be executed in parallel, namely, multiple threads in the same thread block are executed in parallel in the L2 stage;
(3) in the stage R1, data of different thread blocks have no dependency relationship, and the thread blocks can run in parallel, namely the multithreading blocks of the same factor graph in the stage R1 run in parallel; the same thread block has no data dependency relationship among the sub-tasks of the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to be executed in parallel, namely R1 stage, the threads in the same thread block are executed in parallel;
(4) in the stage R2, data of different thread blocks have no dependency relationship, and the thread blocks can run in parallel, namely the multithreading blocks of the same factor graph in the stage R2 run in parallel; the same thread block has no data dependency relationship among the sub-tasks of the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to be executed in parallel, namely R2 stage, the threads in the same thread block are executed in parallel;
in step 2, when the GPU is initialized, the L and R arrays used in the decoding process are distributively stored in the shared memory of each thread block, that is, in a complete cycle process, the shared memory is exchanged for 2 times only by the thread blocks existing in the global, and all other operations can use the shared memory in the thread blocks. The method comprises the following specific steps:
the storage space on the GPU mainly includes a global memory and a shared memory within a thread block (referred to as an intra-block shared memory or a shared memory). The access speed of the shared memory is high, but the capacity is relatively small; the global memory space is large, and the access speed is low.
The main data of Polar code belief propagation decoding are matrices L and R, both of which are N (N + 1). When the code length N is large, the storage space required by L and R is large. In order to improve the memory access speed, the invention divides L and R in the calculation process, and stores the L and R in two-dimensional arrays of Local _ L and Local _ R in a shared memory in blocks of N thread blocks, wherein the first dimension and the second dimension of the Local _ L are respectively N/N1 and N +1, and the size of the Local _ R is the same as that of the Local _ L. The matrix L and R are distributively stored in the shared memory in each thread block by the following method:
(1) for 0<=j<=n-n1,L b*(N/N1)+d2*N1+d1,j And R b*(N/N1)+d2*N1+d1,j Local _ L [ d2 × N1+ d1 in the shared memory of the b-th thread block][j]And Local _ R [ d 2N 1+ d1][j]Wherein b is 0, 1., N1-1; d2 ═ 0, 1., N/(N1)*N1)-1;d1=0,1,...,N1-1;
(2) For n-n1<=j<=n,L b*(N/N1)+d2*N1+d1,j And R b*(N/N1)+d2*N1+d1,j Local _ L [ d2 × N1+ b ] stored in the shared memory of the d1 th thread block][j+1]And Local _ R [ d 2N 1+ b][j+1]Wherein b is 0, 1., N1-1; d2 ═ 0, 1., N/(N1 × N1) -1; d1 ═ 0, 1.., N1-1.
According to the distributed storage method, at each stage (s-n-1.., n-n1-1) of the L1 stage, the scheme of storing L and R in each thread block in a distributed manner is the same, so that each thread block only needs to use the shared memory in the block in each stage iteration in the L1 stage, and data does not need to be exchanged among the thread blocks. Similarly, iterations within the stages of L2, R1, and R2 also do not require data to be exchanged between thread blocks. The distributed storage schemes of the L1 and the L2 are different, so that data exchange between thread blocks existing in the whole office is needed between the L1 stage and the L2 stage; the distributed storage scheme for the L2 and R1 phases is the same, so there is no need to exchange data between thread blocks through global existence between the complete L2 phase and the start R1 phase. Similarly, data exchange between thread blocks existing in the whole office is needed between the stage of completing R1 and the stage of starting R2; data does not need to be exchanged between thread blocks through global presence between the completion of the R2 phase and the L1 phase to begin the next iteration. Therefore, in a complete cycle (namely including 4 stages and 2 n-stage iteration), the shared memory is exchanged for 2 times only by the thread blocks existing in the global, and all other operations can use the shared memory in the thread blocks.
In order to improve the locality of access, the GPU global memory allocation described in step 2 is optimized as follows:
(1) the global memory includes common data for all factor graphs and private data for each factor graph. The global memory used by each factor graph is continuously stored. The global memory used by the pth thread block is a structure graph _ info [ p ], and its members include: the system comprises a factor graph permutation array, an inverse permutation array and a global memory space for exchanging the shared memory by the thread blocks.
(2) Graph _ info [ p ]. swap shared memory between thread blocks is stored continuously according to the order in which thread blocks are read, i.e., when each thread block is read from graph _ info [ p ]. swap, its address space is continuous.
The invention has the advantages and positive effects that:
(1) the parallel decoding of the invention comprises three levels of parallelism: the first level is parallelism between multiple factor graphs, each factor graph being responsible for several thread blocks. Because different factor graphs have no data dependency relationship in the iteration process, thread blocks of different factor graphs can naturally run in parallel; the second level is the parallel of the multithread blocks of the same factor graph, in the invention, the iteration of one factor graph is responsible for N1 thread blocks, the thread blocks designed by the invention are divided into work to ensure that no data dependence exists among the thread blocks in the same stage (namely L1, L2, R1 and R2), so that the N1 thread blocks can be executed in parallel; the third layer is the multithread parallelism in the same thread block, the invention divides the iteration of the same thread block at each stage into N/2N1 subtasks which have no dependence on each other, then distributes the subtasks to a plurality of threads in the thread block to execute in parallel, and after each thread executes all the subtasks responsible for the stage, the synchronization in the thread block is carried out. The method can fully utilize the parallel core resources of the GPU and has high calculation efficiency. The invention has small synchronization overhead: in a complete cycle (namely comprising 4 stages and 2n levels of iteration), the invention only needs to use the synchronization between the thread blocks for 2 times, and the rest 2n-2 levels all use the synchronization mechanism in the thread blocks, so that the synchronization overhead is small. In addition, the whole decoding process on the GPU uses a kernel function, and the starting expense of the kernel function is reduced to the maximum extent.
(2) The invention stores the main data in iteration in the shared memory in each thread block in a distributed manner, thereby improving the memory access efficiency and the running speed. In a complete cycle (namely, the process comprises 4 stages and 2n stages of iteration in total), the shared memory is exchanged for 2 times only through the thread blocks existing in the whole process, and all other operations can use the shared memory in the thread blocks. In addition, in the global space used by the invention, the proprietary data of the same factor graph are continuously stored, and the global memory space used for exchanging the shared memory among the thread blocks is continuously stored according to the reading sequence of the thread blocks, thereby optimizing the storage locality and improving the memory access efficiency and the running speed.
Drawings
FIG. 1 is a flowchart of a GPU-based Polar code high-speed parallel decoding method according to the invention.
Detailed Description
The present invention will be described in detail with reference to the accompanying drawings.
The invention provides a GPU-based Polar code high-speed parallel decoding method, which adopts a decoding algorithm based on a belief propagation list. The coding method comprises three stages: initialization stage, decoding stage, and result returning stage. The initialization stage includes the following steps 1 and 2, the decoding stage includes the following steps 3 and 4, the following step 5 is a result feedback stage, and the whole decoding process is shown in fig. 1 and specifically includes the following steps:
step 1: and (4) initializing the host. Sequentially comprises the following steps: allocating memory space for information bit marks, factor graph permutation and inverse permutation information, signals received by a receiver, decoding results, namely log-likelihood ratios of source bits (step 1.1), initializing information and variables (step 1.2), storing the received signals and calculating the log-likelihood ratios of coded bits (step 1.3); the method comprises the following specific steps:
step 1.1: allocating a host memory space, sequentially comprising: the information bit flag array InfoBitFlags, the signal array y received by the receiver, and the log-likelihood ratio array cLLR of the coded bits.
Step 1.2: the host initializes the information bit flag array InfoBitFlags, the factor graph permutation array Perm, and calculates the inverse permutation array InvPerm according to the factor graph permutation array.
Step 1.3: the host stores the signal received by the receiver in array y and the signal-to-noise ratio of the channel in variable SNR. The host calculates a code bit log likelihood ratio array cLLR y SNR from the received signal and the signal-to-noise ratio.
Step 2: and initializing the GPU. Sequentially comprises the following steps: distributing global memory of the GPU, sending data to the GPU by the host, starting a parallel decoding thread of the GPU, distributing shared memory by the GPU, initializing the shared memory, and assigning values to the array of the shared memory according to the global memory; the specific process is as follows:
step 2.1: and distributing the global memory on the GPU. Including common data for all factor graphs and proprietary data for each factor graph. The common data of all the factor graphs comprises a code bit log-likelihood ratio array cLLR, an information bit flag array InfoBitFlags and a decoding result, namely a source bit log-likelihood ratio array uLLR. The proprietary data of each factor graph is stored continuously, the proprietary data of the p-th factor graph is stored in a structure graph _ info [ p ], and the members of the proprietary data comprise: the system comprises a factor graph permutation array Perm, an inverse permutation array InvPerm and a global memory space swap for a thread block exchange shared memory.
Step 2.2: and sending an information bit flag array, a factor graph permutation array, a reverse permutation array and a coding bit log-likelihood ratio array of a host memory to the GPU, wherein the factor graph permutation array and the reverse permutation array are factor graph exclusive data and are stored in a structure graph _ info [ p ].
Step 2.3: the host starts parallel decoding threads of the GPU, and the number of the thread blocks is P × N1, each thread block comprises T threads, wherein T is equal to the number of cores contained in each streaming multiprocessor. All threads execute the same decoding kernel function, and the threads are distinguished by thread indexes, wherein the index of each thread is ((p, b), t), the (p, b) is the thread block index, and the t is the thread index in the thread block.
Step 2.4: the thread of the GPU allocates shared memory in the thread blocks, which comprises two-dimensional arrays Local _ L [ N/N1] [ N +2] and Local _ R [ N/N1] [ N +2], and all elements are initialized to 0.
Step 2.5: thread number 0, i.e., thread ((P, b),0) (P-0, 1, P-1; b-0, 1, 2) of each thread block on the GPU n1 -1) giving Local _ L [ 2] a bit log-likelihood ratio based on the information bit flag and the code bit][n+2]And Local _ R [ 2]][0]And (7) assigning values.
In the program, this step can be implemented by a recirculation, and the specific flow is as follows:
(1) the loop index variable is d, d is 0,1
(2) Calculating dd ═ (d% N1) × N1+ (d/N1)
(3) The global memory address corresponding to Local _ L [ d ] [ N +2] in the Local memory is dd × N1+ b, and the index of the cause sub-map (i.e., the factor sub-map before replacement) is graph _ info [ p ]. InvPerm [ dd × N1+ b ], so that the value of cblr [ graph _ info [ p ]. InvPerm [ dd × N1+ b ] is assigned to Local _ L [ d ] [ N +2 ].
(4) The global memory address corresponding to Local _ R [ d ] [0] in the Local memory is b × N1+ d, and the index in the cause graph (i.e., the factor graph before replacement) is graph _ info [ p ]. InvPerm [ b × N1+ d ]. If the information bit flag InfoBitFlags [ graph _ info [ p ]. InvPerm [ b. N1+ d ] ] is 0, Local _ R [ d ] [0] is set to 1e + 30.
An example procedure is as follows:
Figure BDA0002568040640000071
Figure BDA0002568040640000081
the index is ((P, b),0), P ═ 0,1,. ang., P-1; b is 0,1, 2 n1 -1 threads share P2 n1 P2 of n1 Threads may be executed in parallel.
And step 3: the decoding kernel function performs a plurality of loop iterations, and the maximum loop number is preset by a program. Each loop includes stages L1 (step 3.1), exchanging thread block shared memory between stages L1-L2 (step 3.2), L2 (step 3.3), R1 (step 3.4), exchanging thread block shared memory between stages R1-R2 (step 3.5), R2 (step 3.6), and determining whether a factor graph satisfies an early termination condition or has reached a maximum number of loops, and setting a variable p _ good (step 3.7).
Step 3.1: the first stage of the left iteration, stage L1, has the stage number s-n-1.
Each level contains three levels of parallelism:
(1) parallelism between multi-factor graphs: the p-th factor graph is indexed by (p,0),...,(p,2 n1 -1) is responsible for P-0, 1. The thread blocks of each factor graph are independent and can be performed in parallel;
(2) parallel of multithread blocks of the same factor graph: each factor graph is composed of 2 n1 Each thread block is responsible for computation, and the b (b) is 0,1 n1 The first set of dimension indices L and R used by 1) thread blocks is L1_ block (b) ═ ia × 2 n1 +b,ia=0,...,2 n-n1 -1} so there is no data dependency between thread blocks and it can run in parallel.
(3) Multithreading within the same factor graph thread block in parallel: the calculation of each thread block at each stage can be decomposed into N/N1/2-2 n-n1-1 And (5) sub-tasks. Numbered i (i ═ 0, 1.., 2.) n-n1-1 The subtask of 1) uses a first set of indices of L and R at level s as:
L1_Task(b,i,s)={floor(i/2 s-n1 )*2 s+1 +a*2 s +(i%2 s-n1 )*2 n1 +b:a=0,1};
therefore, there is no data dependency relationship between the subtasks in the same thread block, so that these subtasks can be divided into min (T, 2) n-n1-1 ) And each group of subtasks is responsible for one thread in the thread block, and all threads can execute in parallel. After each thread completes the set of subtasks for which it is responsible, thread synchronization within the thread block is performed.
According to the thread and subtask division scheme of the present invention, the ith subtask of a thread block (p, b) at the s-th level requires the computation of:
L up,s =f(L down,s+1 +R down,s ,L up,s+1 )
L down,s =L down,s+1 +f(L up,s+1 ,R up,s )
wherein
up=floor(i/2 s-n1 )*2 s+1 +(i%2 s-n1 )*2 n1 +b
down=floor(i/2 s-n1 )*2 s+1 +2 s +(i%2 s-n1 )*2 n1 +b
According to the distributed storage scheme of the present invention, at level L1Segment, up and down are Local _ up and Local _ down, respectively, and s +1 are s +1 and s +2, respectively, where Local _ up is floor (i'/2) j )*2 j+1 +(i’%2 j ),Local_down=floor(i’/2 j )*2 j+1 +2 j +(i’%2 j ),i’=(i%2 n -2n1 )*2 n-2n1 +floor(i/2 n-2n1 ),i’=0,1,...,2 n-n1-1 -1, j-s-n + n1, Local _ up and Local _ down being the shared memory address of the thread block (p, b).
Therefore, the above calculation program can be expressed as:
Local_L[Local_up][s+1]=f(Local_L[Local_down][s+2]+Local_R[Local_down][s+1],Local_L[up][s+2]);
Local_L[Local_down][s+1]=Local_L[Local_down][s+2]+f(Local_L[Local_up][s+2],Local_R[up][s+1]);
the function f is defined as: f (x, y) ═ 2tanh -1 (tanh (x/2) tanh (y/2)), in actual calculation, it is generally approximated by f (x, y) ≈ 0.9375sgn (x) sgn (y) min (| x |, | y |).
In the program, the step can be realized by two cycles, and the specific flow is as follows:
(1) the loop index variable of the first iteration is s, n-n1. (Note that in this loop, the loop index variable s is decremented by 1 each time.)
(2) Calculate j-s-n + n1.
(3) The loop index variable of the second loop is i, i ═ T, T + T,. times, T + (floor (2) n-n1-1 -1-t)/T)*T。
(4) The binary representation of i includes n-n1-1 bits, i% being the lower j bits of i (2) j ) Assigning variable i _ LSB; and the high (n-n1-1-j) bit of i, i.e. floor (i/(2) j ) Is assigned to the variable i _ MSB.
(5) Shared memory addresses Local _ up and Local _ down are calculated, where Local _ up is (i _ MSB < (j +1)) + i _ LSB and Local _ down is Local _ up + (1< < j).
(6) Calculating Local _ L [ Local _ up ] [ s +1] and Local _ L [ Local _ down ] [ s +1], wherein the former has a value of f (Local _ L [ Local _ down ] [ s +2] + Local _ R [ Local _ down ] [ s +1], Local _ L [ up ] [ s +2]), and the latter has a value of Local _ L [ Local _ down ] [ s +1] } ═ Local _ L [ Local _ down ] [ s +2] +
f(Local_L[Local_up][s+2],Local_R[up][s+1])。
(7) After the second recirculation is completed, __ synchreads () is called to synchronize the threads in the thread block.
An example procedure is as follows:
Figure BDA0002568040640000101
step 3.2: exchanging Local _ L [ ] [ n-n1+1] in the shared memory among the thread blocks of each factor graph through a global memory, namely exchanging the thread blocks between L1-L2 stages to share the memory; the method comprises the following specific steps:
step 3.2.1 thread number 0, i.e. thread ((P, b),0) (P0, 1, P1, b 0,1, 2, per thread block n1 -1) sharing the thread blocks (p, b) to Local _ L [ d2 x 2] in memory n1 +d1][n-n1+1]Write to graph _ info [ p ] in global memory].swap[d1*2 n-n1 +d2*2 n1 +b]Wherein d1 is 0,1 n1 -1;d2=0,1,...,2 n-2n1 -1. No ((P, b),0), P ═ 0,1,. ang, P-1; b is 0,1, 2 n1 -1 threads share P2 n1 And (4) respectively. Since there is no overlap between the addresses where threads write swaps, P2 n1 Threads may execute in parallel.
Step 3.2.2 2 of each factor graph n1 The thread blocks perform synchronization between the thread blocks. Synchronization is not required between thread blocks of different factor graphs.
Step 3.2.3 thread block (p, d1) thread # 0 ((p, d1),0) will map _ info [ p ] in global memory].swap[d1*2 n-n1 +d2*2 n1 +b]Write thread blocks (p, d1) share Local _ L [ d2 x 2] in memory n1 +b][n-n1]Wherein b is 0,1, 2 n1 -1;d1=0,1,...,2 n1 -1;d2=0,1,...,2 n-2n1 -1. Numbered ((P, d1),0), P ═ 0,1,. ang., P-1; d1 ═ 0, 1., 2 n1 -1 threads share P2 n1 P2 of n1 Threads may execute in parallel.
Step 3.3: the second stage of the left iteration, stage L2, has a stage number s-n 1-1.
Each level also contains three levels of parallelism:
(1) parallelism among multiple factor graphs: the p-th factor graph is numbered (p, 0., (p, 2.) n1 Thread block of-1), P-0, 1. The thread blocks of each factor graph are independent from each other and can be performed in parallel;
(2) parallel of multithread blocks of the same factor graph: each factor graph is composed of 2 n1 Each thread block is responsible for computation, and the b (b) is 0,1 n1 -1) the first set of dimension indices of L and R used by the thread blocks is L2_ block (b) { b × 2 } n-n1 +ia,ia=0,...,2 n-n1 -1} so there is no data dependency between thread blocks and it can run in parallel.
(3) Multithreading within the same factor graph thread block in parallel: the calculation of each thread block at each stage can be decomposed into N/N1/2-2 n-n1-1 And (5) sub-tasks. Numbered i (i ═ 0, 1., 2.) n-n1-1 The subtask of-1) uses a first set of indices of L and R at level s as
L2_Task(b,i,s)={b*2 n-n1 +floor(i/2 s )*2 s+1 +a*2 s +(i%2 s ):a=0,1}
Therefore, there is no data dependency relationship between the subtasks in the same thread block, so that these subtasks can be divided into min (T, 2) n-n1-1 ) And each group of subtasks is responsible for one thread in the thread block, and all threads can execute in parallel. After each thread completes the set of subtasks for which it is responsible, thread synchronization within the thread block is performed.
According to the thread and subtask division scheme of the present invention, the ith subtask of a thread block (p, b) at the s-th level requires the computation of:
L up,s =f(L down,s+1 +R down,s ,L up,s+1 )
L down,s =L down,s+1 +f(L up,s+1 ,R up,s )
wherein, the first and the second end of the pipe are connected with each other,
up=b*2 n-n1 +floor(i/2 s )*2 s+1 +(i%2 s )
down=b*2 n-n1 +floor(i/2 s )*2 s+1 +2 s +(i%2 s )
according to the distributed storage scheme of the present invention, in stage L2, the indexes of up and down in the shared memory are Local _ up and Local _ down, respectively, and the indexes of s and s +1 in the shared memory are s and s +1, respectively, where Local _ up is floor (i/2) s )*2 s+1 +(i%2 s ),Local_down=floor(i/2 s )*2 s+1 +2 s +(i%2 s ) Local _ up and Local _ down are the shared memory addresses of the thread blocks (p, b).
Therefore, the above calculation program can be expressed as:
Local_L[Local_up][s]=f(Local_L[Local_down][s+1]+Local_R[Local_down][s],Local_L[up][s+1]);
Local_L[Local_down][s]=Local_L[Local_down][s+1]+f(Local_L[Local_up][s+1],Local_R[up][s]);
in the program, the step can be realized by two cycles, and the specific flow is as follows:
(1) the loop index variable for the first iteration is s, n-n1, n-n1-1. (Note that in this loop, the loop index variable s is decremented by 1 each time.)
(2) The loop index variable for the second repeat loop is i, i ═ T, T + T n-n1-1 -1-t)/T)*T。
(3) The binary representation of i comprises n-n1-1 bits, i% (2) which is the lower s bits of i s ) Assigning to variable i _ LSB; and the high (n-n1-1-s) bit of i, i.e. floor (i/(2) j ) To the variable i _ MSB.
(4) Shared memory addresses Local _ up and Local _ down are calculated, where Local _ up is (i _ MSB < (s +1)) + i _ LSB and Local _ down is Local _ up + (1< < s).
(5) Calculating Local _ L [ Local _ up ] [ s ] and Local _ L [ Local _ down ] [ s ], wherein the former has a value of f (Local _ L [ Local _ down ] [ s +1] + Local _ R [ Local _ down ] [ s ], Local _ L [ Local _ up ] [ s +1]), and the latter has a value of Local _ L [ Local _ down ] [ s +1] -, L [ Local _ down ] [ s +1] +
f(L[Local_up][s+1],Local_R[Local_up][s])。
(6) After the second recirculation is completed, __ synchreads () is called to synchronize the threads in the thread block.
An example procedure is as follows:
Figure BDA0002568040640000121
step 3.4: the first stage of the right iteration, stage R1, has a stage number s ═ 0.
Each level also contains three levels of parallelism:
(1) parallelism between multi-factor graphs: the p-th factor graph is numbered (p, 0., (p, 2.) n1 Thread block of-1), P-0, 1. The thread blocks of each factor graph are independent and can be performed in parallel;
(2) parallel of multithread blocks of the same factor graph: each factor graph is composed of 2 n1 Each thread block is responsible for computation, and the b (b) is 0,1 n1 -1) the first set of dimension indices of L and R used by the thread blocks is R1_ block (b) { b × 2 } n-n1 +ia,ia=0,...,2 n-n1 -1} so there is no data dependency between thread blocks and it can run in parallel.
(3) Multithreading in the same factor graph thread block is parallel: the calculation of each thread block at each stage can be decomposed into N/N1/2-2 n-n1-1 And (5) subtasks. Numbered i (i ═ 0, 1., 2.) n-n1-1 The subtask of-1) uses a first set of indices of L and R at stage s as R1_ Task (b, i, s) { b × 2 } n-n1 +floor(i/2 s )*2 s+1 +a*2 s +(i%2 s ):a=0,1}
Therefore, there is no data dependency relationship between the subtasks in the same thread block, so that these subtasks can be divided into min (T, 2) n-n1-1 ) And each group of subtasks is responsible for one thread in the thread block, and the threads can execute in parallel. After each thread completes the set of subtasks for which it is responsible, thread synchronization within the thread block is performed.
According to the thread and subtask division scheme of the present invention, the ith subtask of a thread block (p, b) at the s-th level requires the calculation of:
R up,s+1 =f(L down,s+1 +R down,s ,R up,s )
R down,s+1 =R down,s +f(L up,s+1 ,R up,s )
wherein the content of the first and second substances,
up=b*2 n-n1 +floor(i/2 s )*2 s+1 +(i%2 s )
down=b*2 n-n1 +floor(i/2 s )*2 s+1 +2 s +(i%2 s )
according to the distributed storage scheme of the invention, in the stage R1, the indexes of up and down in the shared memory are Local _ up and Local _ down, respectively, and the indexes of s and s +1 in the shared memory are s and s +1, respectively, wherein the Local _ up is floor (i/2) s )*2 s+1 +(i%2 s ),Local_down=floor(i/2 s )*2 s+1 +2 s +(i%2 s ) Local _ up and Local _ down are the shared memory addresses of the thread blocks (p, b).
Therefore, the above calculation program can be expressed as:
Local_R[Local_up][s+1]=
f(Local_L[Local_down][s+1]+Local_R[Local_down][s],R[Local_up][s]);
Local_R[Local_down][s+1]=Local_R[Local_down][s]+
f(L[Local_up][s+1],R[Local_up][s]);
in the program, the step can be realized by two cycles, and the specific flow is as follows:
(1) the loop index variable of the first iteration is s, s ═ 0,1, n-n1, n-n1-1.
(2) The loop index variable for the second repeat loop is i, i ═ T, T + T n-n1-1 -1-t)/T)*T。
(3) The binary representation of i comprises n-n1-1 bits, i% being the lower s bits of i (2) s ) Assigning to variable i _ LSB; and the high (n-n1-1-s) bit of i, i.e. floor (i/(2) s ) Is assigned to the variable i _ MSB.
(4) Calculating the shared memory addresses Local _ up and Local _ down, wherein the Local _ up ═ i _ MSB <
(s+1))+i_LSB,Local_down=Local_up+(1<<s)。
(5) Local _ L [ Local _ up ] [ s +1] and Local _ L [ Local _ down ] [ s +1] are calculated, wherein the former has a value of f (Local _ L [ Local _ down ] [ s +1] + Local _ R [ Local _ down ] [ s ], Local _ R [ Local _ up ] [ s ]), and the latter has a value of Local _ R [ Local _ down ] [ s ] + f (L [ Local _ up ] [ s +1], R [ Local _ up ] [ s ]).
(6) After the second recirculation is completed, __ synchreads () is called to synchronize the threads within the thread block.
An example procedure is as follows:
Figure BDA0002568040640000141
step 3.5: the thread blocks in each factor graph exchange the Local _ R [ ] [ n-n1] in the shared memory through the global memory, namely the thread block shared memory is exchanged among the R1-R2 stages, which is specifically as follows:
step 3.5.1 thread number 0, i.e. thread ((P, b),0) (P0, 1, P1, b 0,1, 2, per thread block n1 -1) sharing the thread blocks (p, b) to Local _ R [ d2 x 2] in memory n1 +d1][n-n1]Write to graph _ info [ p ] in global memory].swap[d1*2 n-n1 +d2*2 n1 +b]Wherein d1 ═ 0, 1., 2 n1 -1;d2=0,1,...,2 n-2n1 -1. No ((P, b),0), P ═ 0,1,. ang, P-1; 0,1, 2 n1 -1 threads share P2 n1 And (4) respectively. Since there is no overlap between the addresses at which threads write swaps, P x 2 n1 Threads may execute in parallel.
Step 3.5.2 2 of each factor graph n1 The individual thread blocks perform synchronization between the thread blocks. Synchronization is not required between thread blocks of different factor graphs.
Step 3.5.3 thread # 0 of the thread chunk (p, d1) ((p, d1),0) maps _ info [ p ] in global memory].swap[d1*2 n-n1 +d2*2 n1 +b]Write thread block shared memory Local _ R [ d2 x 2 n1 +b][n-n1+1]Wherein b is 0,1, 2 n1 -1;d1=0,1,...,2 n1 -1;d2=0,1,...,2 n-2n1 -1. Numbered ((P, d1),0), P ═ 0,1,. ang., P-1; d1 ═ 0, 1., 2 n1 -1 threads share P2 n1 P2 of n1 Threads may execute in parallel.
Step 3.6: the second stage of the right iteration, stage R2, stage number s-n 1.
Each level also contains three levels of parallelism:
(1) parallelism among multiple factor graphs: the p-th factor graph is numbered (p, 0., (p, 2.) n1 -1) is responsible for P-0, 1. The thread blocks of each factor graph are independent from each other and can be performed in parallel;
(2) parallel of multithread blocks of the same factor graph: each factor graph is composed of 2 n1 Each thread block is responsible for computation, and the b (b) is 0,1 n1 A first set of dimension indices of L and R used by 1) thread blocks is R2_ block (b) ═ ia × 2 n1 +b,ia=0,...,2 n-n1 -1} so there is no data dependency between thread blocks and it can run in parallel.
(3) Multithreading in the same factor graph thread block is parallel: the calculation of each thread block at each stage can be decomposed into N/N1/2-2 n-n1-1 And (5) subtasks. Numbered i (i ═ 0, 1.., 2.) n-n1-1 The subtask of 1) uses a first set of indices of L and R at level s as:
R2_Task(b,i,s)={floor(i/2 s-n1 )*2 s+1 +a*2 s +(i%2 s-n1 )*2 n1 +b:a=0,1}
therefore, there is no data dependency relationship between the subtasks in the same thread block, so that these subtasks can be divided into min (T, 2) n-n1-1 ) And each group of subtasks is responsible for one thread in the thread block, and the threads can execute in parallel. After each thread completes the set of subtasks for which it is responsible, thread synchronization within the thread block is performed.
According to the thread and subtask division scheme of the present invention, the ith subtask of a thread block (p, b) at the s-th level requires the calculation of:
R up,s+1 =f(L down,s+1 +R down,s ,R up,s )
R down,s+1 =R down,s +f(L up,s+1 ,R up,s )
wherein the content of the first and second substances,
up=floor(i/2 s-n1 )*2 s+1 +(i%2 s-n1 )*2 n1 +b
down=floor(i/2 s-n1 )*2 s+1 +2 s +(i%2 s-n1 )*2 n1 +b
according to the distributed storage scheme of the present invention, in stage R2, the indices of up and down in the shared memory are Local _ up and Local _ down, respectively, and the indices of s and s +1 in the shared memory are s +1 and s +2, respectively, where Local _ up is floor (i'/2) j )*2 j+1 +(i’%2 j ),Local_down=floor(i’/2 j )*2 j+1 +2 j +(i’%2 j ),i’=(i%2 n -2n1 )*2 n-2n1 +floor(i/2 n-2n1 ),i’=0,1,...,2 n-n1-1 -1, j-s-n + n1, Local _ up and Local _ down being the shared memory address of the thread block (p, b).
Therefore, the above calculation program can be expressed as:
Local_R[Local_up][s+2]=f(Local_L[Local_down][s+2]+Local_R[Local_down][s+1],[Local_up][s+1]);
Local_R[Local_down][s+2]=Local_R[Local_down][s+1]+
f(L[Local_up][s+2],R[Local_up][s+1]);
in the program, the step can be realized by two cycles, and the specific flow is as follows:
(1) the loop index variable for the first iteration is s, s — n1.
(2) Calculate j-s-n + n1.
(3) The loop index variable of the second loop is i, i ═ T, T + T,. times, T + (floor (2) n-n1-1 -1-t)/T)*T。
(4) The binary representation of i includes n-n1-1 bits, i% (2) which is the lower j bits of i j ) Assigning to variable i _ LSB; and the high (n-n1-1-j) bit of i, i.e. floor (i/(2) j ) To the variable i _ MSB.
(5) Shared memory addresses Local _ up and Local _ down are calculated, where Local _ up is (i _ MSB < (j +1)) + i _ LSB and Local _ down is Local _ up + (1< < j).
(6) Local _ L [ Local _ up ] [ s +2] and Local _ L [ Local _ down ] [ s +2] were calculated, wherein the former value was f (Local _ L [ Local _ down ] [ s +2] + Local _ R [ Local _ down ] [ s +1], Local _ R [ Local _ up ] [ s +1]), and the latter value was L [ Local _ up ] [ s +2], Local _ R [ Local _ up ] [ s +1] + f (L [ Local _ up ] [ s +2], Local _ R [ Local _ up ] [ s +1 ]).
(7) After the second recirculation is completed, __ synchreads () is called to synchronize the threads within the thread block.
An example procedure is as follows:
Figure BDA0002568040640000161
Figure BDA0002568040640000171
step 3.7: and judging whether the iteration result of each factor graph meets the early termination condition. If at least one factor graph satisfies the condition, the number p of the factor graph is recorded in the variable p _ good (if a plurality of factor graphs satisfy the condition, any p satisfying the condition is recorded in the variable p _ good), and the loop terminates, i.e., jumps to step 4. Otherwise, namely all factor graphs do not meet the condition, judging whether the preset maximum cycle time is reached, if the maximum cycle time is reached, making p _ good equal to 0 (corresponding to the first factor graph), and skipping to the step 4 after the cycle is terminated; if the preset maximum loop times are not reached, the loop is continued, namely, the step 3.1 is skipped. The early termination condition may have various conditions, such as no iteration improvement, passing of an additional CRC check, and the like. Wherein the maximum loop times may be selected to be the same as the serialization implementation of Polar code belief propagation decoding, and is typically between 50 and 200.
And 4, step 4: the factor graph p _ good represents the thread number 0 of all thread blocks, i.e., the thread ((p _ good, b),0), where b is 0,1, a., N1-1, and the decoding result is stored in the llr after the Local _ L [ ] [0] + Local _ R [ ] [0] in the shared memory is subjected to inverse permutation. There are N1 threads indexed ((p _ good, b),0) that can be executed in parallel.
In the program, this step can be implemented by a recirculation, and the specific flow is as follows:
(1) the loop index variable is d, d is 0,1, a, N/N1-1
(2) The global memory address corresponding to Local [ L ] [0] and Local _ R [ d ] [0] in the Local memory is b × N1+ d, and the index in the cause graph (i.e., the factor graph before replacement) is graph _ info [ p _ good ]. InvPerm [ b × N1+ d ]. Local _ L [ d ] [0] and Local _ R [ d ] [0] are added, and the result is stored in a uLLR [ graph _ info [ p _ good ]. InvPerm [ b ] N1+ d ].
An example procedure is as follows:
for(d=0;d<N/N1;d++)
uLLR[graph_info[p_good].InvPerm[b*N1+d]]=
Local_L[d][0]+Local_R[d][0];
and 5: the host transmits the decoded result, i.e., the log-likelihood ratios of the source bits, llr back from the GPU to the host.

Claims (6)

1. A Polar code high-speed parallel decoding method based on a GPU is characterized in that: the whole decoding process is divided into three stages: the method comprises an initialization stage, a decoding stage and a result returning stage, wherein the initialization stage comprises the following steps 1 and 2, the decoding stage comprises the following steps 3 and 4, and the following step 5 is the result returning stage:
step 1: host initialization
Sequentially comprises the following steps: allocating memory space for information bit marks, factor graph replacement and inverse replacement information, signals received by a receiver, and decoding results, namely log-likelihood ratios of source bits, initializing information and variables, storing the received signals and calculating log-likelihood ratios of coding bits;
step 2: GPU initialization
Sequentially comprises the following steps: distributing global memory of the GPU, sending data to the GPU by the host, starting a parallel decoding thread of the GPU, distributing shared memory by the GPU, initializing the shared memory, and assigning values to an array of the shared memory according to the global memory;
and step 3: the decoding kernel function performs a plurality of loop iterations, and the maximum loop number is preset by the program
Each cycle comprises the following steps: l is 1 Stage, L 1- L 2 Inter-phase exchange thread block shared memory, L 2 Stage, R 1 Stage, R 1 -R 2 Inter-phase exchange thread block shared memory, R 2 Judging the conditions of stage and cycle termination: if the factor graph meets the early termination condition in the circulation process or the maximum circulation times are reached, setting a variable p _ good, terminating the circulation and jumping to the step 4;
and 4, step 4: thread number 0 for all thread blocks of the variable p _ good, i.e. thread ((p _ good, b),0), where b ═ 0,1 1 -1,
Figure FDA0003586617490000011
N is the code length of Polar code, and is shared by Local _ L in the memory][0]+Local_R[][0]After inverse permutation, the result is used as a decoding result;
and 5: the host transmits the decoded result from the GPU back to the host.
2. The method of claim 1, wherein the method comprises the following steps: in step 2, when the GPU is initialized, the L and R arrays used in the decoding process are distributively stored in the shared memory of each thread block, that is, in a complete cycle process, the shared memory is exchanged for 2 times only through the thread blocks existing in the global, and all other operations use the shared memory in the thread blocks.
3. The method for high-speed parallel decoding of Polar codes based on GPU as claimed in claim 1, wherein: the allocating of the global memory in step 2 specifically includes: the global memories used in the same factor graph are stored continuously, and the global memory spaces used for exchanging the shared memory among the thread blocks are stored continuously according to the reading sequence of the thread blocks, that is, when each thread block reads from the exchange space, the read address spaces are continuous.
4. The method of claim 1, wherein the method comprises the following steps: each loop of the step 3 loop iteration comprises the following steps:
step 3.1: the first stage of the left iteration, i.e. L 1 Stages, including n-1, n 1 Iteration of stages, where n 1 =log 2 N 1
Step 3.2: local _ L in the shared memory is exchanged between the thread blocks of each factor graph through the global memory][n-n 1 +1];
Step 3.3: second stage of left iteration, i.e. L 2 Stage, including n-n 1 1,0 level iteration;
step 3.4: the first stage of the right iteration, R 1 Stages including 0,1, n-n 1 -level 1 iteration;
step 3.5: local _ R in the shared memory is exchanged between the thread blocks of each factor graph through the global memory][n-n 1 ];
Step 3.6: second stage of rightward iteration, i.e. R 2 Stage including n-n 1 ,., n-1 level iterations;
step 3.7: and judging whether a factor graph meets an early termination condition or reaches the maximum cycle number, and setting a variable p _ good.
5. The method of claim 4, wherein the Polar code high-speed parallel decoding method based on GPU is characterized in that: l in step 3 1 Stage, L 2 Stage, R 1 Stage and R 2 Stages, each stage comprising three levels of parallelism:
the first level is parallelism between multiple factor graphs, each factor graph consisting of N 1 Each thread block is responsible for, the thread blocks of each factor graph are independent, and the thread blocks of different factor graphs run in parallel;
the second level is the parallelism of multithread blocks of the same factor graph, each factor graph is composed of N 1 Each thread block is responsible for calculation, and different thread blocks run in parallel without data dependence;
the third level is multithread parallelism in the same thread block, and the calculation of each thread block in each level is divided into
Figure FDA0003586617490000021
Subtasks, each subtask having no data dependency relationship therebetween, each subtask being divided into
Figure FDA0003586617490000022
The group is a core number contained in each streaming multiprocessor on the GPU, each group of subtasks is responsible for one thread in the thread block, and all threads are executed in parallel; after each thread completes the set of subtasks for which it is responsible, thread synchronization within the thread block is performed.
6. The method of claim 5, wherein the method comprises the following steps: the division details of the parallel of the same factor graph multithreading block in the second level and the multithreading parallel in the same threading block in the third level are as follows:
(1) at L 1 Stage, there is no dependency between the data of different thread blocks, and all thread blocks run in parallel, i.e. L 1 Multithread blocks of the same factor graph in the same stage are parallel; the same thread block has no data dependency relationship among the sub-tasks of the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to be executed in parallel, namely L 1 In the same thread block, multiple threads are parallel;
(2) at L 2 Stage, there is no dependency relationship between data of different thread blocks, and each thread block runs in parallel, i.e. L 2 Multithread blocks of the same factor graph in the same stage are parallel; the same thread block has no data dependency relationship among the sub-tasks of the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to be executed in parallel, namely L 2 In the same thread block, multiple threads are parallel;
(3) at R 1 Phase, data of different thread blocks do not haveDependency relationships, with blocks of threads running in parallel, i.e. R 1 Multithread blocks of the same factor graph in the same stage are parallel; the same thread block has no data dependency relationship among the sub-tasks of the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to be executed in parallel, namely R 1 In the same thread block, multiple threads are parallel;
(4) at R 2 Stage, there is no dependency relationship between data of different thread blocks, and all thread blocks run in parallel, namely R 2 Multi-thread blocks in the same factor graph are parallel in the stage; the same thread block has no data dependency relationship among the sub-tasks of the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to be executed in parallel, namely R 2 And (4) carrying out multithreading in parallel in the same thread block.
CN202010629868.3A 2020-07-03 2020-07-03 Polar code high-speed parallel decoding method based on GPU Active CN111966405B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010629868.3A CN111966405B (en) 2020-07-03 2020-07-03 Polar code high-speed parallel decoding method based on GPU

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010629868.3A CN111966405B (en) 2020-07-03 2020-07-03 Polar code high-speed parallel decoding method based on GPU

Publications (2)

Publication Number Publication Date
CN111966405A CN111966405A (en) 2020-11-20
CN111966405B true CN111966405B (en) 2022-07-26

Family

ID=73361314

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010629868.3A Active CN111966405B (en) 2020-07-03 2020-07-03 Polar code high-speed parallel decoding method based on GPU

Country Status (1)

Country Link
CN (1) CN111966405B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113014270B (en) * 2021-02-22 2022-08-05 上海大学 Partially folded polarization code decoder with configurable code length

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101814039A (en) * 2010-02-02 2010-08-25 北京航空航天大学 GPU-based Cache simulator and spatial parallel acceleration simulation method thereof
CN105843590A (en) * 2016-04-08 2016-08-10 深圳航天科技创新研究院 Parallel pre-decoding method and system for instruction sets
CN108462495A (en) * 2018-04-03 2018-08-28 北京航空航天大学 A kind of multielement LDPC code high-speed parallel decoder and its interpretation method based on GPU
CN111026444A (en) * 2019-11-21 2020-04-17 中国航空工业集团公司西安航空计算技术研究所 GPU parallel array SIMT instruction processing model

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9362956B2 (en) * 2013-01-23 2016-06-07 Samsung Electronics Co., Ltd. Method and system for encoding and decoding data using concatenated polar codes
CN107506828B (en) * 2016-01-20 2020-11-03 中科寒武纪科技股份有限公司 Artificial neural network computing device and method for sparse connection
GB2567507B (en) * 2018-04-05 2019-10-02 Imagination Tech Ltd Texture filtering with dynamic scheduling

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101814039A (en) * 2010-02-02 2010-08-25 北京航空航天大学 GPU-based Cache simulator and spatial parallel acceleration simulation method thereof
CN105843590A (en) * 2016-04-08 2016-08-10 深圳航天科技创新研究院 Parallel pre-decoding method and system for instruction sets
CN108462495A (en) * 2018-04-03 2018-08-28 北京航空航天大学 A kind of multielement LDPC code high-speed parallel decoder and its interpretation method based on GPU
CN111026444A (en) * 2019-11-21 2020-04-17 中国航空工业集团公司西安航空计算技术研究所 GPU parallel array SIMT instruction processing model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Polar码并行级联结构设计及性能分析;潘小飞;《通信技术》;20160210;全文 *

Also Published As

Publication number Publication date
CN111966405A (en) 2020-11-20

Similar Documents

Publication Publication Date Title
CN111213125B (en) Efficient direct convolution using SIMD instructions
TWI406176B (en) Preparing instruction groups for a processor having multiple issue ports
CN107622020B (en) Data storage method, access method and device
TWI442222B (en) Flash memory device and method for managing a flash memory device
US20120151182A1 (en) Performing Function Calls Using Single Instruction Multiple Data (SIMD) Registers
CN105049061A (en) Advanced calculation-based high-dimensional polarization code decoder and polarization code decoding method
KR20180021850A (en) Mapping an instruction block to an instruction window based on block size
US7895417B2 (en) Select-and-insert instruction within data processing systems
US7428630B2 (en) Processor adapted to receive different instruction sets
CN111966405B (en) Polar code high-speed parallel decoding method based on GPU
US8539462B2 (en) Method for allocating registers for a processor based on cycle information
CN111860805B (en) Fractal calculation device and method, integrated circuit and board card
WO2020181670A1 (en) Control flow optimization in graphics processing unit
US11429299B2 (en) System and method for managing conversion of low-locality data into high-locality data
Qi et al. Implementation of accelerated BCH decoders on GPU
CN116158029A (en) Polarization code decoder and method for polarization code decoding
US20150205606A1 (en) Tree-based thread management
US8745339B2 (en) Multi-core system and method for processing data in parallel in multi-core system
CN111966404B (en) GPU-based regular sparse code division multiple access SCMA high-speed parallel decoding method
Fuentes-Alventosa et al. Cuvle: Variable-length encoding on cuda
US9672042B2 (en) Processing system and method of instruction set encoding space utilization
CN107861834A (en) A kind of method based on wrong pre-detection skill upgrading solid state hard disc reading performance
WO2022208173A2 (en) Vectorizing a loop
US20210042111A1 (en) Efficient encoding of high fanout communications
CN117934258A (en) Data processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant