CN111966405A - Polar code high-speed parallel decoding method based on GPU - Google Patents

Polar code high-speed parallel decoding method based on GPU Download PDF

Info

Publication number
CN111966405A
CN111966405A CN202010629868.3A CN202010629868A CN111966405A CN 111966405 A CN111966405 A CN 111966405A CN 202010629868 A CN202010629868 A CN 202010629868A CN 111966405 A CN111966405 A CN 111966405A
Authority
CN
China
Prior art keywords
stage
thread
parallel
local
blocks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010629868.3A
Other languages
Chinese (zh)
Other versions
CN111966405B (en
Inventor
李舒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Innovation Research Institute of Beihang University
Original Assignee
Hangzhou Innovation Research Institute of Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Innovation Research Institute of Beihang University filed Critical Hangzhou Innovation Research Institute of Beihang University
Priority to CN202010629868.3A priority Critical patent/CN111966405B/en
Publication of CN111966405A publication Critical patent/CN111966405A/en
Application granted granted Critical
Publication of CN111966405B publication Critical patent/CN111966405B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/03Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words
    • H03M13/05Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits
    • H03M13/13Linear codes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L1/00Arrangements for detecting or preventing errors in the information received
    • H04L1/004Arrangements for detecting or preventing errors in the information received by using forward error control
    • H04L1/0056Systems characterized by the type of code used
    • H04L1/0057Block codes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention discloses a GPU-based Polar code high-speed parallel decoding method, wherein the whole decoding process can be divided into three stages: the initialization stage, the decoding stage and the result returning stage specifically comprise: step 1: initializing a host; step 2: initializing a GPU; and step 3: the decoding kernel function carries out a plurality of times of loop iteration, and the maximum loop times are preset by a program; and 4, step 4: for the thread No. 0 of all thread blocks of the factor graph p _ good, the decoding result is obtained after inverse permutation is carried out on the Local _ L [ ] [0] + Local _ R [ ] [0] in the shared memory; and 5: the host transmits the decoded result from the GPU back to the host. The method comprises three levels of parallelism, namely parallelism among multiple subgraphs, multithreading blocks and multithreading. In addition, the method of the invention reduces the starting expense of the kernel function to the maximum extent; the memory access efficiency and the running speed are improved.

Description

Polar code high-speed parallel decoding method based on GPU
Technical Field
The invention belongs to the technical field of communication, and relates to a Polar code high-speed parallel decoding method based on a Graphics Processing Unit (GPU).
Background
Polar Codes proposed by Erdal Arikan in 2008 (reference [1]: Erdal Arikan, "Channel Polarization: a Method for Constructing Capacity-arrival Codes", IEEE ISIT2008) are currently the only Channel coding methods that can be strictly proven to reach shannon limit. Polar codes have been adopted officially by the 5G standardization organization. Decoding methods of Polar codes can be divided into two types: a serial cancellation based approach and a belief propagation based approach. The serial cancellation-based method has small calculation amount, but the algorithm is serial in nature, so the decoding delay is large; for the belief propagation-based method, in order to ensure the error correction performance of Polar code decoding, a belief propagation list algorithm, namely an iterative algorithm based on a plurality of permutation factor graphs, is generally adopted, so that the decoding method has a large operation amount, but the belief propagation list algorithm has the potential of parallel implementation.
On the other hand, GPU technology has been developed rapidly in recent years, and a commercial grade GPU card can have over 4000 cores for parallel processing, which provides a cost-effective hardware basis for parallel computing.
Disclosure of Invention
The invention aims to provide a GPU-based Polar code high-speed parallel decoding method to realize low-delay and high-throughput decoding.
The invention provides a Polar code high-speed parallel decoding method based on a GPU. The method comprises three levels of parallelism and can fully utilize the core resources on the GPU. The invention also designs an efficient distributed storage method, and improves the memory access efficiency and the running speed. The whole decoding process can be divided into three stages: initialization stage, decoding stage, and result returning stage. The initialization phase comprises the following steps 1 and 2, the decoding phase comprises the following steps 3 and 4, and the following step 5 is a result returning phase.
Step 1: and (4) initializing the host. Sequentially comprises the following steps: allocating memory space for information bit mark, factor graph replacement and inverse replacement information, signals received by a receiver, decoding results, namely log-likelihood ratio of source bits, initializing information and variables, storing the received signals and calculating log-likelihood ratio of coding bits.
Step 2: and initializing the GPU. Sequentially comprises the following steps: and (3) distributing global memory of the GPU, sending data to the GPU by the host, starting a parallel decoding thread of the GPU, distributing shared memory by the GPU, initializing the shared memory, and assigning values to the array of the shared memory according to the global memory.
And step 3: the decoding kernel function performs a plurality of loop iterations, and the maximum loop times are preset by a program. Each cycle comprises the following steps: exchanging thread block shared memory among L1 stage, L1-L2 stage, exchanging thread block shared memory among L2 stage, R1 stage, R1-R2 stage, R2 stage and judging cycle termination condition. If the factor graph meets the early termination condition during the loop or the maximum number of loops has been reached, the variable p _ good is set and the loop is terminated and jumps to step 4.
And 4, step 4: thread number 0, i.e., thread ((p _ good, b),0), for all thread blocks of the factor graph p _ good, where b is 0,1,.., N1-1,
Figure BDA0002568040640000021
n is the code length of Polar code, and is shared by Local _ L in the memory][0]+Local_R[][0]After inverse permutation, the result is used as the decoding result.
And 5: the host transmits the decoded result from the GPU back to the host.
Wherein, each loop of the loop iteration of the step 3 comprises the following steps:
step 3.1: the first stage of the left iteration, the L1 stage, includes an n-1,., n-n1 stage iteration, where n is log2N,n1=log2N1;
Step 3.2: exchanging Local _ L [ ] [ n-n1+1] in the shared memory between the thread blocks of each factor graph through the global memory;
step 3.3: the second stage of the left iteration, the L2 stage, includes the nth-n 1-1.., 0 stage iterations;
step 3.4: the first stage of the right iteration, the R1 stage, includes the 0 th,.., n-n1-1 stage iterations;
step 3.5: exchanging Local _ R [ ] [ n-n1] in the shared memory between the thread blocks of each factor graph through the global memory;
step 3.6: the second phase of the right iteration, the R2 phase, includes the n-n 1.., n-1 level iterations;
step 3.7: and judging whether a factor graph meets an early termination condition or reaches the maximum cycle number, and setting a variable p _ good.
Wherein, the stages L1, L2, R1 and R2 in step 3 each include three levels of parallelism as follows:
the first level is parallelism between multiple factor graphs, each factor graph being responsible for N1 thread blocks. Because different factor graphs have no data dependency in the iterative process, the thread blocks of different factor graphs can naturally run in parallel.
The second level is parallelism of multi-threaded blocks of the same factor graph. When the code length N of the Polar code is larger, the core resources and the shared memory resources of a stream multiprocessor on the GPU cannot support the requirement of fully parallelizing a factor graph. For this reason, in the present invention, N1 thread blocks are responsible for the iteration of one factor graph. The invention divides the leftward propagation and the rightward propagation of each iteration into two stages, namely four stages. Left propagation includes iterations of the n-1, 0 stages, in two stages: the first stage is the (n-1) th stage to the (n-n) 1 th stage, which is called the L1 stage; the second stage is the n-n1-1 st stage to the 0 th stage, referred to as the L2 stage. The direction of propagation to the right and left is opposite, including iterations of levels 0, 1.., n-1, which are divided into two stages: the first stage is the 0 th stage to the n-n1-1 th stage, which is called the R1 stage; the second stage is the n-n1 th to n-1 st stages, referred to as the R2 stage.
The third level is multi-thread parallelism within the same thread block. The calculation of each thread block at each stage can be divided into N/N1/2-2n-n1-1Subtasks, each subtask having no data dependency relationship therebetween, each subtask being divided into min (T, 2)n-n1-1) Each group of subtasks is responsible for one thread in the thread block, and all threads can be executed in parallel; after each thread completes the set of subtasks for which it is responsible, thread synchronization within the thread block is performed.
The division details of the parallel of the same factor graph multithreading block in the second level and the multithreading parallel (namely, a plurality of subtasks) in the same threading block in the third level are as follows:
(1) in the stage L1, data of different thread blocks have no dependency relationship, and the thread blocks can run in parallel, namely multithreading blocks of the same factor graph in the stage L1 run in parallel; the same thread block has no data dependency among the sub-tasks at the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to be executed in parallel, namely L1 stage, the threads in the same thread block are executed in parallel;
(2) in the stage L2, data of different thread blocks have no dependency relationship, and the thread blocks can run in parallel, namely multithreading blocks of the same factor graph in the stage L2 run in parallel; the same thread block has no data dependency among the sub-tasks at the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to be executed in parallel, namely L2 stage, the threads in the same thread block are executed in parallel;
(3) in the stage R1, data of different thread blocks have no dependency relationship, and the thread blocks can run in parallel, namely the multithreading blocks of the same factor graph in the stage R1 run in parallel; the same thread block has no data dependency among the sub-tasks at the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to be executed in parallel, namely R1 stage, the threads in the same thread block are executed in parallel;
(4) in the stage R2, data of different thread blocks have no dependency relationship, and the thread blocks can run in parallel, namely the multithreading blocks of the same factor graph in the stage R2 run in parallel; the same thread block has no data dependency among the sub-tasks at the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to be executed in parallel, namely R2 stage, the threads in the same thread block are executed in parallel;
in step 2, when the GPU is initialized, the L and R arrays used in the decoding process are distributively stored in the shared memory of each thread block, that is, in a complete cycle process, the shared memory is exchanged for 2 times only by the thread blocks existing in the global, and all other operations can use the shared memory in the thread blocks. The method comprises the following specific steps:
the storage space on the GPU mainly includes a global memory and a shared memory within a thread block (referred to as an intra-block shared memory or a shared memory). The access speed of the shared memory is high, but the capacity is relatively small; the global memory space is large, and the access speed is low.
The main data of Polar code belief propagation decoding are matrices L and R, both of which are N (N + 1). When the code length N is large, the memory space required for L and R is large. In order to improve the memory access speed, the invention divides L and R in the calculation process, and stores the L and R in two-dimensional arrays of Local _ L and Local _ R in a shared memory in blocks of N thread blocks, wherein the first dimension and the second dimension of the Local _ L are respectively N/N1 and N +1, and the size of the Local _ R is the same as that of the Local _ L. The matrix L and R are distributed and stored in each thread block in a shared memory method as follows:
(1) for 0<=j<=n-n1,Lb*(N/N1)+d2*N1+d1,jAnd Rb*(N/N1)+d2*N1+d1,jLocal _ L [ d2 × N1+ d1 stored in the shared memory of the b-th thread block][j]And Local _ R [ d 2N 1+ d1][j]Wherein b is 0,1,.., N1-1; d2 ═ 0, 1., N/(N1 × N1) -1; d1 ═ 0,1,. ang, N1-1;
(2) for n-n1<=j<=n,Lb*(N/N1)+d2*N1+d1,jAnd Rb*(N/N1)+d2*N1+d1,jLocal _ L [ d 2N 1+ b ] stored in shared memory of the d1 th thread block][j+1]And Local _ R [ d 2N 1+ b][j+1]Wherein b is 0,1,.., N1-1; d2 ═ 0, 1., N/(N1 × N1) -1; d1 ═ 0, 1.., N1-1.
According to the distributed storage method, at each stage (s-n-1.., n-n1-1) of the L1 stage, the scheme of storing L and R in each thread block in a distributed manner is the same, so that each thread block only needs to use the shared memory in the block in each stage iteration in the L1 stage, and data does not need to be exchanged among the thread blocks. Similarly, iterations within the stages of L2, R1, and R2 also do not require data to be exchanged between thread blocks. The distributed storage schemes of the L1 and the L2 are different, so that data exchange between thread blocks existing in the whole office is needed between the L1 stage and the L2 stage; the distributed storage schemes of the L2 and R1 phases are the same, so that data exchange between thread blocks existing in the whole office is not needed between the L2 phase and the R1 phase. Similarly, data exchange between thread blocks existing in the whole office is needed between the stage of completing R1 and the stage of starting R2; data exchange between thread blocks through global presence is not required between the completion of the R2 phase and the start of the L1 phase for the next iteration. Therefore, in a complete cycle (including 4 stages and 2 n-level iterations), the shared memory is exchanged for 2 times only by the thread blocks existing in the global, and all other operations can use the shared memory in the thread blocks.
In order to improve the locality of access, the GPU global memory allocation described in step 2 is optimized as follows:
(1) the global memory includes common data for all factor graphs and private data for each factor graph. The global memory used by each factor graph is continuously stored. The global memory used by the pth thread block is a structure graph _ info [ p ], and the members of the structure graph _ info [ p ] include: the system comprises a factor graph permutation array, an inverse permutation array and a global memory space for exchanging the shared memory by the thread blocks.
(2) Graph _ info [ p ]. swap used for exchanging shared memory between thread blocks is stored continuously according to the order of reading thread blocks, i.e. when each thread block is read from graph _ info [ p ]. swap, its address space is continuous.
The invention has the advantages and positive effects that:
(1) the parallel decoding of the invention comprises three levels of parallelism: the first level is parallelism between multiple factor graphs, each factor graph being responsible for several thread blocks. Because different factor graphs have no data dependency relationship in the iteration process, thread blocks of different factor graphs can naturally run in parallel; the second level is the parallel of the multithread blocks of the same factor graph, in the invention, the iteration of one factor graph is responsible for N1 thread blocks, the thread blocks designed by the invention are divided into work to ensure that no data dependency exists among the thread blocks in the same stage (namely L1, L2, R1 and R2), so that the N1 thread blocks can be executed in parallel; the third layer is the multithread parallelism in the same thread block, the invention divides the iteration of the same thread block at each stage into N/2N1 subtasks which have no dependence on each other, then distributes the subtasks to a plurality of threads in the thread block to execute in parallel, and after each thread executes all the subtasks responsible for the stage, the synchronization in the thread block is carried out. The method can fully utilize the parallel core resources of the GPU and has high calculation efficiency. The invention has small synchronization overhead: in a complete cycle (namely comprising 4 stages and 2n levels of iteration), the invention only needs to use the synchronization between the thread blocks for 2 times, and the rest 2n-2 levels all use the synchronization mechanism in the thread blocks, so that the synchronization overhead is small. In addition, the whole decoding process on the GPU uses a kernel function, and the starting expense of the kernel function is reduced to the maximum extent.
(2) The invention stores the main data in iteration in the shared memory in each thread block in a distributed manner, thereby improving the memory access efficiency and the running speed. In a complete cycle (namely, the process comprises 4 stages and 2n levels of iteration), the shared memory is exchanged for 2 times only by the thread blocks in the whole process, and all other operations can use the shared memory in the thread blocks. In addition, in the global space used by the invention, the proprietary data of the same factor graph are continuously stored, and the global memory space used for exchanging the shared memory among the thread blocks is continuously stored according to the reading sequence of the thread blocks, thereby optimizing the storage locality and improving the memory access efficiency and the running speed.
Drawings
FIG. 1 is a flowchart of a GPU-based Polar code high-speed parallel decoding method according to the invention.
Detailed Description
The present invention will be described in detail with reference to the accompanying drawings.
The invention provides a GPU-based Polar code high-speed parallel decoding method, which adopts a decoding algorithm based on a belief propagation list. The coding method comprises three stages: initialization stage, decoding stage, and result returning stage. The initialization stage includes the following steps 1 and 2, the decoding stage includes the following steps 3 and 4, the following step 5 is a result feedback stage, and the whole decoding process is shown in fig. 1 and specifically includes the following steps:
step 1: and (4) initializing the host. Sequentially comprises the following steps: allocating memory space for information bit marks, factor graph replacement and inverse replacement information, signals received by a receiver, decoding results, namely log-likelihood ratios of source bits (step 1.1), initializing information and variables (step 1.2), storing the received signals and calculating log-likelihood ratios of coded bits (step 1.3); the method comprises the following specific steps:
step 1.1: allocating a host memory space, sequentially comprising: the information bit flag array InfoBitFlags, the signal array y received by the receiver, and the log-likelihood ratio array cLLR of the coded bits.
Step 1.2: the host initializes the information bit flag array InfoBitFlags, the factor graph permutation array Perm, and calculates the inverse permutation array InvPerm according to the factor graph permutation array.
Step 1.3: the host stores the signal received by the receiver in array y and the signal-to-noise ratio of the channel in variable SNR. The host calculates a code bit log likelihood ratio array cLLR y SNR from the received signal and the signal-to-noise ratio.
Step 2: and initializing the GPU. Sequentially comprises the following steps: distributing global memory of the GPU, sending data to the GPU by the host, starting a parallel decoding thread of the GPU, distributing shared memory by the GPU, initializing the shared memory, and assigning values to an array of the shared memory according to the global memory; the specific process is as follows:
step 2.1: and distributing the global memory on the GPU. Including common data for all factor graphs and proprietary data for each factor graph. The common data of all the factor graphs comprises a code bit log-likelihood ratio array cLLR, an information bit flag array InfoBitFlags and a decoding result, namely a source bit log-likelihood ratio array uLLR. The proprietary data of each factor graph is stored continuously, the proprietary data of the p-th factor graph is stored in a structure graph _ info [ p ], and the members of the proprietary data include: the system comprises a factor graph permutation array Perm, an inverse permutation array InvPerm and a global memory space swap for a thread block exchange shared memory.
Step 2.2: and sending an information bit flag array, a factor graph permutation array, a reverse permutation array and a coding bit log-likelihood ratio array of a host memory to the GPU, wherein the factor graph permutation array and the reverse permutation array are factor graph exclusive data and are stored in a structure graph _ info [ p ].
Step 2.3: the host starts parallel decoding threads of the GPU, and the number of the thread blocks is P × N1, each thread block comprises T threads, wherein T is equal to the number of cores contained in each streaming multiprocessor. All threads execute the same decoding kernel function, and the threads are distinguished by thread indexes, wherein the index of each thread is ((p, b), t), wherein (p, b) is the thread block index, and t is the thread index in the thread block.
Step 2.4: the thread of the GPU allocates shared memory in the thread blocks, and the shared memory comprises two-dimensional arrays of Local _ L [ N/N1] [ N +2] and Local _ R [ N/N1] [ N +2], and all elements are initialized to 0.
Step 2.5: thread number 0, i.e., thread ((P, b),0) (P-0, 1.., P-1; b-0, 1.., 2), of each thread block on the GPUn1-1) giving Local _ L [ 2] a bit log-likelihood ratio based on the information bit flag and the code bit][n+2]And Local _ R [ 2]][0]And (7) assigning values.
In the program, this step can be implemented by a recirculation, and the specific flow is as follows:
(1) the loop index variable is d, d is 0,1
(2) Calculate dd ═ (d% N1) × N1+ (d/N1)
(3) The global memory address corresponding to Local _ L [ d ] [ N +2] in the Local memory is dd × N1+ b, and the index of the cause sub-map (i.e., the factor sub-map before replacement) is graph _ info [ p ]. InvPerm [ dd × N1+ b ], so that the value of cblr [ graph _ info [ p ]. InvPerm [ dd × N1+ b ] is assigned to Local _ L [ d ] [ N +2 ].
(4) The global memory address corresponding to Local _ R [ d ] [0] in the Local memory is b × N1+ d, and the index in the cause graph (i.e., the factor graph before replacement) is graph _ info [ p ]. InvPerm [ b × N1+ d ]. If the information bit flag InfoBitFlags [ graph _ info [ p ]. InvPerm [ b. N1+ d ] ] is 0, Local _ R [ d ] [0] is set to 1e + 30.
An example procedure is as follows:
Figure BDA0002568040640000071
Figure BDA0002568040640000081
the index is ((P, b),0), P ═ 0,1,. ang., P-1; b is 0,1, 2n1-1 threads share P2n1P2 ofn1Threads may be executed in parallel.
And step 3: the decoding kernel function performs a plurality of loop iterations, and the maximum loop times are preset by a program. Each loop includes stages L1 (step 3.1), exchanging thread block shared memory between stages L1-L2 (step 3.2), L2 (step 3.3), R1 (step 3.4), exchanging thread block shared memory between stages R1-R2 (step 3.5), R2 (step 3.6), and determining whether a factor graph satisfies an early termination condition or has reached a maximum number of loops, and setting a variable p _ good (step 3.7).
Step 3.1: the first stage of the left iteration, the L1 stage, has the stage number s-n-1.
Each level contains three levels of parallelism:
(1) parallelism between multi-factor graphs: the p-th factor graph is indexed by (p,0)n1-1) is responsible for P-0, 1. The thread blocks of each factor graph are independent and can be performed in parallel;
(2) parallel of multithread blocks of the same factor graph: each factor graph is composed of 2n1Each thread block is responsible for computation, and the b (b) is 0,1n1-1) the first set of dimension indices of L and R used by the thread blocks is L1_ block (b) { ia × 2 }n1+b,ia=0,...,2n-n1-1} so there is no data dependency between thread blocks and it can run in parallel.
(3) Multithreading in the same factor graph thread block is parallel: the calculation of each thread block at each stage can be decomposed into N/N1/2-2n-n1-1And (5) subtasks. Numbered i (i ═ 0, 1.., 2.)n-n1-1The subtask of 1) uses a first set of indices of L and R at level s as:
L1_Task(b,i,s)={floor(i/2s-n1)*2s+1+a*2s+(i%2s-n1)*2n1+b:a=0,1};
therefore, there is no data dependency relationship between the subtasks in the same thread block, so that these subtasks can be divided into min (T, 2)n-n1-1) And each group of subtasks is responsible for one thread in the thread block, and all threads can execute in parallel. Each thread completes what it is responsible forAnd after the subtasks are grouped, thread synchronization in the thread blocks is carried out.
According to the thread and subtask division scheme of the present invention, the ith subtask of a thread block (p, b) at the s-th level requires the computation of:
Lup,s=f(Ldown,s+1+Rdown,s,Lup,s+1)
Ldown,s=Ldown,s+1+f(Lup,s+1,Rup,s)
wherein
up=floor(i/2s-n1)*2s+1+(i%2s-n1)*2n1+b
down=floor(i/2s-n1)*2s+1+2s+(i%2s-n1)*2n1+b
According to the distributed storage scheme of the present invention, in stage L1, the indexes of up and down in the shared memory are Local _ up and Local _ down, respectively, and the indexes of s and s +1 in the shared memory are s +1 and s +2, respectively, where Local _ up is floor (i'/2)j)*2j+1+(i’%2j),Local_down=floor(i’/2j)*2j+1+2j+(i’%2j),i’=(i%2n -2n1)*2n-2n1+floor(i/2n-2n1),i’=0,1,...,2n-n1-1-1, j-s-n + n1, Local _ up and Local _ down being the shared memory address of the thread block (p, b).
Therefore, the above calculation program can be expressed as:
Local_L[Local_up][s+1]=f(Local_L[Local_down][s+2]+Local_R[Local_down][s+1],Local_L[up][s+2]);
Local_L[Local_down][s+1]=Local_L[Local_down][s+2]+f(Local_L[Local_up][s+2],Local_R[up][s+1]);
the function f is defined as: f (x, y) ═ 2tanh-1(tanh (x/2) tanh (y/2)), in practical calculations, it is usually approximated by f (x, y) ≈ 0.9375sgn (x) sgn (y) min (| x |, | y |).
In the program, the step can be realized by two cycles, and the specific flow is as follows:
(1) the loop index variable for the first iteration is s, s-n-1. (Note that in this loop, the loop index variable s is decremented by 1 each time.)
(2) Calculate j-s-n + n1.
(3) The loop index variable for the second repeat loop is i, i ═ T, T + Tn-n1-1-1-t)/T)*T。
(4) The binary representation of i includes n-n1-1 bits, i% being the lower j bits of i (2)j) Assigning variable i _ LSB; and the high (n-n1-1-j) bit of i, i.e. floor (i/(2)j) Is assigned to the variable i _ MSB.
(5) The shared memory addresses, Local _ up and Local _ down, are calculated, where Local _ up ═ (i _ MSB < (j +1)) + i _ LSB, and Local _ down ═ Local _ up + (1< < j).
(6) Calculating Local _ L [ Local _ up ] [ s +1] and Local _ L [ Local _ down ] [ s +1], wherein the former has a value of f (Local _ L [ Local _ down ] [ s +2] + Local _ R [ Local _ down ] [ s +1], Local _ L [ up ] [ s +2]), and the latter has a value of Local _ L [ Local _ down ] [ s +1] } ═ Local _ L [ Local _ down ] [ s +2] +
f(Local_L[Local_up][s+2],Local_R[up][s+1])。
(7) After the second recirculation is completed, __ synchreads () is called to synchronize the threads within the thread block.
An example procedure is as follows:
Figure BDA0002568040640000101
step 3.2: exchanging Local _ L [ ] [ n-n1+1] in the shared memory between the thread blocks of each factor graph through a global memory, namely exchanging the thread blocks between L1-L2 stages to share the memory; the method comprises the following specific steps:
step 3.2.1 thread number 0, i.e. thread ((P, b),0) (P0, 1, P1, b 0,1, 2, per thread blockn1-1) sharing the thread blocks (p, b) to Local _ L [ d2 x 2] in memoryn1+d1][n-n1+1]Write to graph _ info [ p ] in global memory].swap[d1*2n-n1+d2*2n1+b]Wherein d1 is 0,1n1-1;d2=0,1,...,2n-2n1-1. No ((P, b),0), P ═ 0,1,. ang, P-1; b is 0,1, 2n1-1 threads share P2n1And (4) respectively. Since there is no overlap between the addresses where threads write swaps, P2n1Threads may be executed in parallel.
Step 3.2.2 2 of each factor graph 2n1The thread blocks perform synchronization between the thread blocks. Synchronization is not required between thread blocks of different factor graphs.
Step 3.2.3 thread block (p, d1) thread # 0 ((p, d1),0) will map _ info [ p ] in global memory].swap[d1*2n-n1+d2*2n1+b]Write thread blocks (p, d1) share Local _ L [ d2 x 2] in memoryn1+b][n-n1]Wherein b is 0,1, 2n1-1;d1=0,1,...,2n1-1;d2=0,1,...,2n-2n1-1. No ((P, d1),0), P ═ 0,1,. ang, P-1; d1 ═ 0, 1.., 2n1-1 threads share P2n1P2 ofn1Threads may be executed in parallel.
Step 3.3: the second stage of the left iteration, stage L2, has stage number s-n 1-1.
Each level also contains three levels of parallelism:
(1) parallelism between multi-factor graphs: the p-th factor graph is numbered (p, 0., (p, 2.)n1-1) is responsible for P-0, 1. The thread blocks of each factor graph are independent and can be performed in parallel;
(2) parallel of multithread blocks of the same factor graph: each factor graph is composed of 2n1Each thread block is responsible for computation, and the b (b) is 0,1n1-1) the first set of dimension indices of L and R used by the thread blocks is L2_ block (b) { b × 2 }n-n1+ia,ia=0,...,2n-n1-1} so there is no data dependency between thread blocks and it can run in parallel.
(3) Multithreading in the same factor graph thread block is parallel: the calculation of each thread block at each stage can be decomposed into N/N1/2-2n-n1-1And (5) subtasks. Numbered i (i ═ 0, 1.., 2.)n-n1-1The subtask of 1) uses the first set of indices of L and R at the s-th level as
L2_Task(b,i,s)={b*2n-n1+floor(i/2s)*2s+1+a*2s+(i%2s):a=0,1}
Therefore, there is no data dependency relationship between the subtasks in the same thread block, so that these subtasks can be divided into min (T, 2)n-n1-1) And each group of subtasks is responsible for one thread in the thread block, and all threads can execute in parallel. After each thread completes the set of subtasks for which it is responsible, thread synchronization within the thread block is performed.
According to the thread and subtask division scheme of the present invention, the ith subtask of a thread block (p, b) at the s-th level requires the computation of:
Lup,s=f(Ldown,s+1+Rdown,s,Lup,s+1)
Ldown,s=Ldown,s+1+f(Lup,s+1,Rup,s)
wherein,
up=b*2n-n1+floor(i/2s)*2s+1+(i%2s)
down=b*2n-n1+floor(i/2s)*2s+1+2s+(i%2s)
according to the distributed storage scheme of the present invention, in stage L2, the indexes of up and down in the shared memory are Local _ up and Local _ down, respectively, and the indexes of s and s +1 in the shared memory are s and s +1, respectively, where Local _ up is floor (i/2)s)*2s+1+(i%2s),Local_down=floor(i/2s)*2s+1+2s+(i%2s) Local _ up and Local _ down are the shared memory addresses of the thread blocks (p, b).
Therefore, the above calculation program can be expressed as:
Local_L[Local_up][s]=f(Local_L[Local_down][s+1]+Local_R[Local_down][s],Local_L[up][s+1]);
Local_L[Local_down][s]=Local_L[Local_down][s+1]+f(Local_L[Local_up][s+1],Local_R[up][s]);
in the program, the step can be realized by two cycles, and the specific flow is as follows:
(1) the loop index variable for the first iteration is s, s-n 1, n-n1-1. (Note that in this loop, the loop index variable s is decremented by 1 each time.)
(2) The loop index variable for the second repeat loop is i, i ═ T, T + Tn-n1-1-1-t)/T)*T。
(3) The binary representation of i comprises n-n1-1 bits, i% being the lower s bits of i (2)s) Assigning variable i _ LSB; and the high (n-n1-1-s) bit of i, i.e. floor (i/(2)j) Is assigned to the variable i _ MSB.
(4) The shared memory addresses, Local _ up and Local _ down, are calculated, where Local _ up ═ (i _ MSB < (s +1)) + i _ LSB, and Local _ down ═ Local _ up + (1< < s).
(5) Calculating Local _ L [ Local _ up ] [ s ] and Local _ L [ Local _ down ] [ s ], wherein the former has a value of f (Local _ L [ Local _ down ] [ s +1] + Local _ R [ Local _ down ] [ s ], Local _ L [ Local _ up ] [ s +1]), and the latter has a value of Local _ L [ Local _ down ] [ s +1] -, L [ Local _ down ] [ Local _ s +1] +
f(L[Local_up][s+1],Local_R[Local_up][s])。
(6) After the second recirculation is completed, __ synchreads () is called to synchronize the threads within the thread block.
An example procedure is as follows:
Figure BDA0002568040640000121
step 3.4: the first stage of the right iteration, the R1 stage, has a stage number s of 0.
Each level also contains three levels of parallelism:
(1) parallelism between multi-factor graphs: the p-th factor graph is numbered (p, 0., (p, 2.)n1-1) is responsible for P-0, 1. The thread blocks of each factor graph are independent and can be performed in parallel;
(2) parallel of multithread blocks of the same factor graph: each factor graph is composed of 2n1Each thread block is responsible for computation, and the b (b) is 0,1n1-1) the first set of dimension indices of L and R used by the thread blocks is R1_ block (b) { b × 2 }n-n1+ia,ia=0,...,2n-n1-1} so there is no data dependency between thread blocks and it can run in parallel.
(3) Multithreading in the same factor graph thread block is parallel: the calculation of each thread block at each stage can be decomposed into N/N1/2-2n-n1-1And (5) subtasks. Numbered i (i ═ 0, 1.., 2.)n-n1-1The subtask of-1) uses a first set of indices of L and R at stage s as R1_ Task (b, i, s) { b × 2 }n-n1+floor(i/2s)*2s+1+a*2s+(i%2s):a=0,1}
Therefore, there is no data dependency relationship between the subtasks in the same thread block, so that these subtasks can be divided into min (T, 2)n-n1-1) And each group of subtasks is responsible for one thread in the thread block, and all threads can execute in parallel. After each thread completes the set of subtasks for which it is responsible, thread synchronization within the thread block is performed.
According to the thread and subtask division scheme of the present invention, the ith subtask of a thread block (p, b) at the s-th level requires the computation of:
Rup,s+1=f(Ldown,s+1+Rdown,s,Rup,s)
Rdown,s+1=Rdown,s+f(Lup,s+1,Rup,s)
wherein,
up=b*2n-n1+floor(i/2s)*2s+1+(i%2s)
down=b*2n-n1+floor(i/2s)*2s+1+2s+(i%2s)
according to the distributed storage scheme of the invention, in the stage R1, the indexes of up and down in the shared memory are Local _ up and Local _ down, respectively, and the indexes of s and s +1 in the shared memory are s and s +1, respectively, wherein the Local _ up is floor (i/2)s)*2s+1+(i%2s),Local_down=floor(i/2s)*2s+1+2s+(i%2s) Local _ up and Local _ down are the shared memory addresses of the thread blocks (p, b).
Therefore, the above calculation program can be expressed as:
Local_R[Local_up][s+1]=
f(Local_L[Local_down][s+1]+Local_R[Local_down][s],R[Local_up][s]);
Local_R[Local_down][s+1]=Local_R[Local_down][s]+
f(L[Local_up][s+1],R[Local_up][s]);
in the program, the step can be realized by two cycles, and the specific flow is as follows:
(1) the loop index variable for the first iteration is s, s-0, 1, n-n1.
(2) The loop index variable for the second repeat loop is i, i ═ T, T + Tn-n1-1-1-t)/T)*T。
(3) The binary representation of i comprises n-n1-1 bits, i% being the lower s bits of i (2)s) Assigning variable i _ LSB; and the high (n-n1-1-s) bit of i, i.e. floor (i/(2)s) Is assigned to the variable i _ MSB.
(4) Calculating the shared memory addresses of Local _ up and Local _ down, wherein the Local _ up ═ i _ MSB < >
(s+1))+i_LSB,Local_down=Local_up+(1<<s)。
(5) Local _ L [ Local _ up ] [ s +1] and Local _ L [ Local _ down ] [ s +1] are calculated, where the former has a value of f (Local _ L [ Local _ down ] [ s +1] + Local _ R [ Local _ down ] [ s ], Local _ R [ Local _ up ] [ s ]), and the latter has a value of Local _ R [ Local _ down ] [ s ] + f (L [ Local _ up ] [ s +1], R [ Local _ up ] [ s ]).
(6) After the second recirculation is completed, __ synchreads () is called to synchronize the threads within the thread block.
An example procedure is as follows:
Figure BDA0002568040640000141
step 3.5: the thread blocks of each factor graph exchange the Local _ R [ ] [ n-n1] in the shared memory through the global memory, namely the thread blocks are exchanged among the R1-R2 stages to share the memory, which is concretely as follows:
step 3.5.1 thread number 0, i.e. thread ((P, b),0) (P0, 1, P1, b 0,1, 2, per thread blockn1-1) sharing the thread blocks (p, b) to Local _ R [ d2 x 2] in memoryn1+d1][n-n1]Write to graph _ info in global memory[p].swap[d1*2n-n1+d2*2n1+b]Wherein d1 is 0,1n1-1;d2=0,1,...,2n-2n1-1. No ((P, b),0), P ═ 0,1,. ang, P-1; b is 0,1, 2n1-1 threads share P2n1And (4) respectively. Since there is no overlap between the addresses where threads write swaps, P2n1Threads may be executed in parallel.
Step 3.5.2 2 of each factor graphn1The thread blocks perform synchronization between the thread blocks. Synchronization is not required between thread blocks of different factor graphs.
Step 3.5.3 thread block (p, d1) thread # 0 ((p, d1),0) maps _ info [ p ] in global memory].swap[d1*2n-n1+d2*2n1+b]Write thread block shared memory Local _ R [ d2 x 2n1+b][n-n1+1]Wherein b is 0,1, 2n1-1;d1=0,1,...,2n1-1;d2=0,1,...,2n-2n1-1. No ((P, d1),0), P ═ 0,1,. ang, P-1; d1 ═ 0, 1.., 2n1-1 threads share P2n1P2 ofn1Threads may be executed in parallel.
Step 3.6: the second stage of the right iteration, the R2 stage, has the stage number s-n 1.
Each level also contains three levels of parallelism:
(1) parallelism between multi-factor graphs: the p-th factor graph is numbered (p, 0., (p, 2.)n1-1) is responsible for P-0, 1. The thread blocks of each factor graph are independent and can be performed in parallel;
(2) parallel of multithread blocks of the same factor graph: each factor graph is composed of 2n1Each thread block is responsible for computation, and the b (b) is 0,1n1-1) the first set of dimension indices of L and R used by the thread blocks is R2_ block (b) { ia × 2 }n1+b,ia=0,...,2n-n1-1} so there is no data dependency between thread blocks and it can run in parallel.
(3) Multithreading in the same factor graph thread block is parallel: the calculation of each thread block at each stage can be decomposed into N/N1/2-2n-n1-1And (5) subtasks. Numbered i (i ═ 0, 1.., 2.)n-n1-11) of L and R used by the subtasks at level sThe first dimension index set is:
R2_Task(b,i,s)={floor(i/2s-n1)*2s+1+a*2s+(i%2s-n1)*2n1+b:a=0,1}
therefore, there is no data dependency relationship between the subtasks in the same thread block, so that these subtasks can be divided into min (T, 2)n-n1-1) And each group of subtasks is responsible for one thread in the thread block, and all threads can execute in parallel. After each thread completes the set of subtasks for which it is responsible, thread synchronization within the thread block is performed.
According to the thread and subtask division scheme of the present invention, the ith subtask of a thread block (p, b) at the s-th level requires the computation of:
Rup,s+1=f(Ldown,s+1+Rdown,s,Rup,s)
Rdown,s+1=Rdown,s+f(Lup,s+1,Rup,s)
wherein,
up=floor(i/2s-n1)*2s+1+(i%2s-n1)*2n1+b
down=floor(i/2s-n1)*2s+1+2s+(i%2s-n1)*2n1+b
according to the distributed storage scheme of the present invention, in stage R2, the indices of up and down in the shared memory are Local _ up and Local _ down, respectively, and the indices of s and s +1 in the shared memory are s +1 and s +2, respectively, where Local _ up is floor (i'/2)j)*2j+1+(i’%2j),Local_down=floor(i’/2j)*2j+1+2j+(i’%2j),i’=(i%2n -2n1)*2n-2n1+floor(i/2n-2n1),i’=0,1,...,2n-n1-1-1, j-s-n + n1, Local _ up and Local _ down being the shared memory address of the thread block (p, b).
Therefore, the above calculation program can be expressed as:
Local_R[Local_up][s+2]=f(Local_L[Local_down][s+2]+Local_R[Local_down][s+1],[Local_up][s+1]);
Local_R[Local_down][s+2]=Local_R[Local_down][s+1]+
f(L[Local_up][s+2],R[Local_up][s+1]);
in the program, the step can be realized by two cycles, and the specific flow is as follows:
(1) the loop index variable for the first iteration is s, s — n1.
(2) Calculate j-s-n + n1.
(3) The loop index variable for the second repeat loop is i, i ═ T, T + Tn-n1-1-1-t)/T)*T。
(4) The binary representation of i includes n-n1-1 bits, i% being the lower j bits of i (2)j) Assigning variable i _ LSB; and the high (n-n1-1-j) bit of i, i.e. floor (i/(2)j) Is assigned to the variable i _ MSB.
(5) The shared memory addresses, Local _ up and Local _ down, are calculated, where Local _ up ═ (i _ MSB < (j +1)) + i _ LSB, and Local _ down ═ Local _ up + (1< < j).
(6) Local _ L [ Local _ up ] [ s +2] and Local _ L [ Local _ down ] [ s +2] are calculated, wherein the former has a value of f (Local _ L [ Local _ down ] [ s +2] + Local _ R [ Local _ down ] [ s +1], Local _ R [ Local _ up ] [ s +1]), and the latter has a value of Local _ R [ Local _ down ] [ s +1] + f (L [ Local _ up ] [ s +2], Local _ R [ Local _ up ] [ s +1 ]).
(7) After the second recirculation is completed, __ synchreads () is called to synchronize the threads within the thread block.
An example procedure is as follows:
Figure BDA0002568040640000161
Figure BDA0002568040640000171
step 3.7: and judging whether the iteration result of each factor graph meets the early termination condition. If at least one factor graph satisfies the condition, the number p of the factor graph is recorded in the variable p _ good (if a plurality of factor graphs satisfy the condition, any p satisfying the condition is recorded in the variable p _ good), and the loop terminates, i.e., jumps to step 4. Otherwise, that is, all the factor graphs do not satisfy the condition, judging whether the preset maximum cycle number has been reached, if the maximum cycle number has been reached, making p _ good equal to 0 (corresponding to the first factor graph), and jumping to the step 4 after the cycle is terminated; if the preset maximum loop times are not reached, the loop is continued, i.e. the step 3.1 is skipped. The early termination condition may have various conditions, such as no improvement in iteration, passing of an additional CRC check, etc. Wherein the maximum loop times may be selected to be the same as the serialization implementation of Polar code belief propagation decoding, and is typically between 50 and 200.
And 4, step 4: the factor graph p _ good is a thread 0, i.e. a thread ((p _ good, b),0), where b is 0,1,., N1-1, and stores Local _ L [ ] [0] + Local _ R [ ] [0] in the llr as a decoding result after inverse permutation in the shared memory. There are N1 threads indexed ((p _ good, b),0) that can be executed in parallel.
In the program, this step can be implemented by a recirculation, and the specific flow is as follows:
(1) the loop index variable is d, d is 0,1
(2) The global memory address corresponding to Local [ L ] [0] and Local _ R [ d ] [0] in the Local memory is b × N1+ d, and the index in the cause graph (i.e., the factor graph before replacement) is graph _ info [ p _ good ]. InvPerm [ b × N1+ d ]. Local _ L [ d ] [0] and Local _ R [ d ] [0] are added, and the result is stored in a uLLR [ graph _ info [ p _ good ]. InvPerm [ b ] N1+ d ].
An example procedure is as follows:
for(d=0;d<N/N1;d++)
uLLR[graph_info[p_good].InvPerm[b*N1+d]]=
Local_L[d][0]+Local_R[d][0];
and 5: the host transmits the decoded result, i.e., the log-likelihood ratios of the source bits, llr back from the GPU to the host.

Claims (6)

1. A Polar code high-speed parallel decoding method based on GPU is characterized in that: the whole decoding process can be divided into three stages: the method comprises an initialization stage, a decoding stage and a result returning stage, wherein the initialization stage comprises the following steps 1 and 2, the decoding stage comprises the following steps 3 and 4, and the following step 5 is the result returning stage:
step 1: host initialization
Sequentially comprises the following steps: allocating memory space for information bit marks, factor graph replacement and inverse replacement information, signals received by a receiver, decoding results, namely log-likelihood ratios of source bits, initializing information and variables, storing the received signals and calculating log-likelihood ratios of coding bits;
step 2: GPU initialization
Sequentially comprises the following steps: distributing global memory of the GPU, sending data to the GPU by the host, starting a parallel decoding thread of the GPU, distributing shared memory by the GPU, initializing the shared memory, and assigning values to an array of the shared memory according to the global memory;
and step 3: the decoding kernel function performs a plurality of loop iterations, and the maximum loop number is preset by the program
Each cycle comprises the following steps: the judgment of the thread block shared memory among the L1 stage, the L1-L2 stages, the L2 stage, the R1 stage, the R1-R2 stages, the R2 stage and the cycle termination condition: if the factor graph meets the early termination condition in the circulation process or the maximum circulation times are reached, setting a variable p _ good, terminating the circulation and jumping to the step 4;
and 4, step 4: thread number 0, i.e., thread ((p _ good, b),0), for all thread blocks of the factor graph p _ good, where b is 0,1,.., N1-1,
Figure FDA0002568040630000011
n is the code length of Polar code, and is shared by Local _ L in the memory][0]+Local_R[][0]After inverse permutation, the decoding result is obtained;
and 5: the host transmits the decoded result from the GPU back to the host.
2. The method for high-speed parallel decoding of Polar codes based on GPU as claimed in claim 1, wherein: in step 2, when the GPU is initialized, the L and R arrays used in the decoding process are distributively stored in the shared memory of each thread block, that is, in a complete cycle process, the shared memory is exchanged for 2 times only by the thread blocks existing in the global, and all other operations can use the shared memory in the thread blocks.
3. The method for high-speed parallel decoding of Polar codes based on GPU as claimed in claim 1, wherein: the allocating of the global memory in step 2 specifically includes: the global memories used in the same factor graph are stored continuously, and the global memory spaces used for exchanging the shared memory among the thread blocks are stored continuously according to the reading sequence of the thread blocks, that is, when each thread block is read from the exchange space, the read address spaces are continuous.
4. The method for high-speed parallel decoding of Polar codes based on GPU as claimed in claim 1, wherein: each loop of the step 3 loop iteration comprises the following steps:
step 3.1: the first stage of the left iteration, the L1 stage, includes an n-1,.., n-n1 stage iteration, where n1 is log2N1;
Step 3.2: exchanging Local _ L [ ] [ n-n1+1] in the shared memory between the thread blocks of each factor graph through the global memory;
step 3.3: the second stage of the left iteration, the L2 stage, includes the nth-n 1-1.., 0 stage iterations;
step 3.4: the first stage of the right iteration, the R1 stage, includes the 0 th,.., n-n1-1 stage iterations;
step 3.5: exchanging Local _ R [ ] [ n-n1] in the shared memory between the thread blocks of each factor graph through the global memory;
step 3.6: the second phase of the right iteration, the R2 phase, includes the n-n 1.., n-1 level iterations;
step 3.7: and judging whether a factor graph meets an early termination condition or reaches the maximum cycle number, and setting a variable p _ good.
5. The method for high-speed parallel decoding of Polar codes based on GPU as claimed in claim 4, wherein: the stages L1, L2, R1 and R2 in step 3 each include three levels of parallelism:
the first layer is the parallelism among multiple factor graphs, each factor graph is responsible for by N1 thread blocks, the thread blocks of each factor graph are independent, and the thread blocks of different factor graphs run in parallel;
the second layer is the parallel of the multi-thread blocks of the same factor graph, each factor graph is calculated by N1 thread blocks, and different thread blocks have no data dependence and run in parallel;
the third level is multithread parallelism in the same thread block, and the calculation of each thread block in each level can be divided into N/N1/2-2n-n1-1Subtasks, each subtask having no data dependency relationship therebetween, each subtask being divided into min (T, 2)n-n1-1) The group comprises a group, wherein T is the number of cores contained in each streaming multiprocessor on the GPU, each group of subtasks is responsible for one thread in a thread block, and all threads can be executed in parallel; after each thread completes the set of subtasks for which it is responsible, thread synchronization within the thread block is performed.
6. The method for high-speed parallel decoding of Polar codes based on GPU as claimed in claim 5, wherein: the division details of the parallel of the same factor graph multithreading block in the second level and the multithreading parallel in the same threading block in the third level are as follows:
(1) in the stage L1, data of different thread blocks have no dependency relationship, and the thread blocks can run in parallel, namely multithreading blocks of the same factor graph in the stage L1 run in parallel; the same thread block has no data dependency among the sub-tasks at the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to be executed in parallel, namely L1 stage, the threads in the same thread block are executed in parallel;
(2) in the stage L2, data of different thread blocks have no dependency relationship, and the thread blocks can run in parallel, namely multithreading blocks of the same factor graph in the stage L2 run in parallel; the same thread block has no data dependency among the sub-tasks at the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to be executed in parallel, namely L2 stage, the threads in the same thread block are executed in parallel;
(3) in the stage R1, data of different thread blocks have no dependency relationship, and the thread blocks can run in parallel, namely the multithreading blocks of the same factor graph in the stage R1 run in parallel; the same thread block has no data dependency among the sub-tasks at the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to be executed in parallel, namely R1 stage, the threads in the same thread block are executed in parallel;
(4) in the stage R2, data of different thread blocks have no dependency relationship, and the thread blocks can run in parallel, namely the multithreading blocks of the same factor graph in the stage R2 run in parallel; the same thread block has no data dependency relationship among the sub-tasks of the s-th level, and the sub-tasks are distributed to a plurality of threads in the thread block to execute in parallel, namely R2 stage, and multiple threads in the same thread block execute in parallel.
CN202010629868.3A 2020-07-03 2020-07-03 Polar code high-speed parallel decoding method based on GPU Active CN111966405B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010629868.3A CN111966405B (en) 2020-07-03 2020-07-03 Polar code high-speed parallel decoding method based on GPU

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010629868.3A CN111966405B (en) 2020-07-03 2020-07-03 Polar code high-speed parallel decoding method based on GPU

Publications (2)

Publication Number Publication Date
CN111966405A true CN111966405A (en) 2020-11-20
CN111966405B CN111966405B (en) 2022-07-26

Family

ID=73361314

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010629868.3A Active CN111966405B (en) 2020-07-03 2020-07-03 Polar code high-speed parallel decoding method based on GPU

Country Status (1)

Country Link
CN (1) CN111966405B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113014270A (en) * 2021-02-22 2021-06-22 上海大学 Partially folded polarization code decoder with configurable code length

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101814039A (en) * 2010-02-02 2010-08-25 北京航空航天大学 GPU-based Cache simulator and spatial parallel acceleration simulation method thereof
US20140208183A1 (en) * 2013-01-23 2014-07-24 Samsung Electronics Co., Ltd. Method and system for encoding and decoding data using concatenated polar codes
CN105843590A (en) * 2016-04-08 2016-08-10 深圳航天科技创新研究院 Parallel pre-decoding method and system for instruction sets
CN107609642A (en) * 2016-01-20 2018-01-19 南京艾溪信息科技有限公司 Computing device and method
CN108462495A (en) * 2018-04-03 2018-08-28 北京航空航天大学 A kind of multielement LDPC code high-speed parallel decoder and its interpretation method based on GPU
US20190311520A1 (en) * 2018-04-05 2019-10-10 Imagination Technologies Limited Texture Filtering with Dynamic Scheduling in Computer Graphics
CN111026444A (en) * 2019-11-21 2020-04-17 中国航空工业集团公司西安航空计算技术研究所 GPU parallel array SIMT instruction processing model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101814039A (en) * 2010-02-02 2010-08-25 北京航空航天大学 GPU-based Cache simulator and spatial parallel acceleration simulation method thereof
US20140208183A1 (en) * 2013-01-23 2014-07-24 Samsung Electronics Co., Ltd. Method and system for encoding and decoding data using concatenated polar codes
CN107609642A (en) * 2016-01-20 2018-01-19 南京艾溪信息科技有限公司 Computing device and method
CN105843590A (en) * 2016-04-08 2016-08-10 深圳航天科技创新研究院 Parallel pre-decoding method and system for instruction sets
CN108462495A (en) * 2018-04-03 2018-08-28 北京航空航天大学 A kind of multielement LDPC code high-speed parallel decoder and its interpretation method based on GPU
US20190311520A1 (en) * 2018-04-05 2019-10-10 Imagination Technologies Limited Texture Filtering with Dynamic Scheduling in Computer Graphics
CN111026444A (en) * 2019-11-21 2020-04-17 中国航空工业集团公司西安航空计算技术研究所 GPU parallel array SIMT instruction processing model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
潘小飞: "Polar码并行级联结构设计及性能分析", 《通信技术》 *
潘小飞: "Polar码并行级联结构设计及性能分析", 《通信技术》, 10 February 2016 (2016-02-10) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113014270A (en) * 2021-02-22 2021-06-22 上海大学 Partially folded polarization code decoder with configurable code length

Also Published As

Publication number Publication date
CN111966405B (en) 2022-07-26

Similar Documents

Publication Publication Date Title
CN111213125B (en) Efficient direct convolution using SIMD instructions
TWI442222B (en) Flash memory device and method for managing a flash memory device
WO2016210027A1 (en) Decoupled processor instruction window and operand buffer
KR20180021850A (en) Mapping an instruction block to an instruction window based on block size
US7895417B2 (en) Select-and-insert instruction within data processing systems
CN108920412B (en) Algorithm automatic tuning method for heterogeneous computer system structure
CN111860805B (en) Fractal calculation device and method, integrated circuit and board card
US20110078418A1 (en) Support for Non-Local Returns in Parallel Thread SIMD Engine
CN111966405B (en) Polar code high-speed parallel decoding method based on GPU
US8539462B2 (en) Method for allocating registers for a processor based on cycle information
US11429299B2 (en) System and method for managing conversion of low-locality data into high-locality data
Qi et al. Implementation of accelerated BCH decoders on GPU
US20180315484A1 (en) A method for operating a semiconductor memory
US9830161B2 (en) Tree-based thread management
US8745339B2 (en) Multi-core system and method for processing data in parallel in multi-core system
CN116158029A (en) Polarization code decoder and method for polarization code decoding
CN107861834A (en) A kind of method based on wrong pre-detection skill upgrading solid state hard disc reading performance
CN111966404B (en) GPU-based regular sparse code division multiple access SCMA high-speed parallel decoding method
CN107851048B (en) Intelligent encoding apparatus, method and computer program for memory
CN118353476B (en) LDPC code decoding method and device based on Davinci architecture
US9672042B2 (en) Processing system and method of instruction set encoding space utilization
US20240184554A1 (en) Vectorizing a loop
US20210042111A1 (en) Efficient encoding of high fanout communications
CN118550697A (en) Recommendation model reasoning acceleration system based on near-memory processing architecture
CN115480825A (en) Data processing method and device and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant