CN109921877B

CN109921877B - LDPC code parallel decoding method based on CUDA architecture under AWGN channel

Info

Publication number: CN109921877B
Application number: CN201811596316.6A
Authority: CN
Inventors: 鲁邹晨; 程诚; 王涛; 刘进; 郭鲁皓
Original assignee: CETC 20 Research Institute
Current assignee: CETC 20 Research Institute
Priority date: 2018-12-26
Filing date: 2018-12-26
Publication date: 2021-11-12
Anticipated expiration: 2038-12-26
Also published as: CN109921877A

Abstract

The invention provides an LDPC code parallel decoding method based on a CUDA framework under an AWGN channel, which fully utilizes the advantages of higher parallelism of a Graphics Processing Unit (GPU) and more operation processors, and in an iterative decoding algorithm of the LDPC code, different code words and nodes corresponding to check equations process information independently, so that the LDPC code parallel decoding method is suitable for being realized by a full parallel platform. The invention firstly directly converts the original CPU serial decoding by the decoding algorithm, and the decoding algorithm and the original CPU serial decoding are strictly consistent in algorithm principle, thereby not causing the loss of error code performance. Secondly, the invention utilizes the constant memory of the GPU to compress and store the check matrix H, thereby effectively reducing the time delay of data access. Finally, the designed parallel decoding scheme realizes simultaneous calling of a large number of parallel threads on the GPU, and performs cooperative processing on updating steps of CN and VN nodes with higher operation complexity in decoding, thereby effectively reducing decoding delay of the LDPC code.

Description

LDPC code parallel decoding method based on CUDA architecture under AWGN channel

Technical Field

The invention relates to the field of channel coding, in particular to a parallel decoding algorithm of a low density parity check code (LDPC code).

Background

The LDPC code is a linear block code with the characteristic of approaching to the Shannon limit, and is widely applied to a plurality of communication systems by virtue of the characteristic of excellent error correction capability and low hardware complexity. Under the application scene of low signal-to-noise ratio, the LDPC code can realize lower bit error rate and has no error floor when the signal-to-noise ratio is high. At present, the realization of efficient codecs of LDPC codes gradually becomes a hotspot of research work on channel coding, and with the rapid development of computer operation processing capability, researchers have higher and higher requirements on the data processing speed of codecs. The common LDPC code has a relatively simple coding circuit design, so the research center of the code mainly focuses on the optimization of decoding complexity and the improvement of throughput, and the traditional CPU platform often has the problem of large decoding delay when the traditional CPU platform completes the decoding of the LDPC code with large code length and more iteration times.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a parallel decoding scheme of an LDPC code, which fully utilizes the advantages of higher parallelism and more operation processors of a Graphics Processing Unit (GPU), and in an iterative decoding algorithm of the LDPC code, different code words and nodes corresponding to check equations process information independently, so that the parallel decoding scheme is suitable for being realized by a full parallel platform.

The technical scheme adopted by the invention for solving the technical problem comprises the following steps:

step 1, for each non-zero element in the LDPC code check matrix H with the size of m multiplied by n, finishing the initialization of a channel on a CPU, and Lq_nm＝LLR_n＝L(x_n|y_n)＝log(P(x_n＝0|y_n)/P(x_n＝1|y_n))＝2y_n/σ²Wherein, Lq_nmRepresenting variable node-specific information, LLR_nA channel initial value representing a bit n;

step 2, allocating a memory on the GPU, storing the check matrix H into a constant memory of the GPU in a compact form, realizing the storage of matrix elements by only 4 bytes, wherein the first two bytes respectively represent a row mark and a column mark of the elements, the third byte represents a shift value of the matrix relative to a unit matrix, and the last byte represents whether the current element is 0; transmitting the LLR value of the channel initialization to the GPU;

step 3, calculating the information outside the check node for each CN node and VN node v belonging to V (m) on the GPU

Wherein a is_v′m＝sign(Lq_v′m) V (m) represents the set of VN nodes participating in the mth check equation, v (m) represents the subset after element v is removed from the set;

after the functional details of the CNP core are defined on the GPU, the processing of the CN node is mapped into a CNP core, a thread Grid1 is distributed to the CNP core, each grade of GPU thread resources used by the thread Grid1 are declared by combining the code word parameters of the LDPC code, and the size of a thread Block is set as the row weight of the LDPC code check matrix; parallel processing is carried out between each row of the check matrix H, and independent calculation of messages is carried out in parallel between a plurality of variable nodes associated with each row;

step 4, respectively calculating each variable node v and m e to M (v) on the GPU

And

m (v) represents a set of CN nodes connected to the v-th VN node, m (v) \\ m represents a subset after element m is removed from the set;

after the function of a variable node processing Kernel function is defined on the GPU, mapping the processing part of the variable node to another Kernel function VNP core, and allocating a GPU Grid2 for acceleration; parallel processing is carried out among all columns of the check matrix H, and independent parallel operation is carried out among a plurality of check nodes related to each column;

step 5, if the iteration is finished, the message value LQ_nMore than or equal to 0 and more than or equal to zero, the judgment is made

Is 0, otherwise is 1; each time the decision of step 4 is performed, the decision codeword

If the syndrome satisfies the check constraint condition

The iteration terminates and will

Outputting as decoding, otherwise, turning to step 2; if the preset maximum iteration number is not successful, the failure is declared, and the decoding iteration is finished.

The invention has the beneficial effects that: firstly, the decoding algorithm is a direct conversion of the original CPU serial decoding, and the two algorithms are strictly consistent in principle, so that the loss of error code performance is not caused. Secondly, the invention utilizes the constant memory of the GPU to compress and store the check matrix H, thereby effectively reducing the time delay of data access. Finally, the designed parallel decoding scheme realizes simultaneous calling of a large number of parallel threads on the GPU, and performs cooperative processing on updating steps of CN and VN nodes with higher operation complexity in decoding, thereby effectively reducing decoding delay of the LDPC code.

Drawings

FIG. 1 is a diagram illustrating a thread block structure of a GPU.

FIG. 2 is a schematic diagram of a simulation system model for LDPC encoding and decoding.

FIG. 3 is a block diagram of an implementation of a CUDA-based LDPC parallel decoder.

Detailed Description

The present invention will be further described with reference to the following drawings and examples, which include, but are not limited to, the following examples.

Aiming at the problem of decoding delay possibly caused by the serial iterative decoding of the LDPC code of the traditional CPU platform, the invention provides a parallel decoding scheme of the LDPC code based on the CUDA architecture, which verifies that the CUDA parallel decoding is adopted, so that the decoding performance is not lost, the acceleration of the decoding process is realized, the decoding time delay is effectively reduced, and the system throughput is improved.

Under the CUDA architecture of the image processor, the invention distributes different tasks finished by the CPU and the GPU in a coding and decoding system based on the mixed programming model of the CPU and the GPU according to the characteristics of each subsystem of noise addition, modulation and coding and decoding, and provides a specific parallel decoder implementation scheme on the GPU.

The invention improves the processing time delay of a decoding part in the original LDPC code coding and decoding system, can finish algorithm acceleration by utilizing the parallel framework and constant memory characteristics of the GPU due to higher calculation complexity of the decoding part, has lower complexity of coding and other modules compared with a decoding module, does not need parallel processing, and only needs serial completion on the CPU.

Aiming at the normalization minimum sum decoding algorithm of the LDPC code, the steps of CN node processing, VN node processing and decoding result calculation completion check are mapped into a plurality of independent kernel functions, so that the steps are respectively or completely operated on a GPU, and parallel acceleration is completed by utilizing respective divided thread grids. The parallel decoding algorithm of the LDPC code mainly comprises the following steps:

the check matrix H of the LDPC code is known as an m x n matrix, a code word

For the transmitted codeword sequence, a sequence of symbols is formed by BPSK modulation

Wherein x_i＝1-2c_iI is more than or equal to 1 and less than or equal to n. After the AWGN channel transmission, the symbol sequence of the code word received by the receiving end is

Wherein y is_i＝x_i+n_iWherein n is_iThe noise value of AWGN channel has mean value of 0 and variance of sigma². According to

The code word sequence obtained by decoding is

Step 1, for each non-zero element in the H matrix of the LDPC code, channel initialization is completed on the CPU:

Lq_nm＝LLR_n＝L(x_n|y_n)＝log(P(x_n＝0|y_n)/P(x_n＝1|y_n))＝2y_n/σ² (1)

Lq_nmrepresenting variable node-specific information, LLR_nA channel initial value (log likelihood ratio) representing bit n;

step 2, GPU initialization: and allocating a memory on the GPU, storing the check matrix H of the QC-LDPC code into a constant memory of the GPU in a compact form, and realizing the storage of matrix elements by only 4 bytes, wherein the first two bytes respectively represent a row mark and a column mark of the elements, the third byte represents a shift value of the matrix relative to a unit matrix, and the last byte represents whether the current element is 0 or not. Then transmitting the LLR value of the channel initialization to the GPU;

step 3, the check node update is completed on the GPU: and for each CN node and VN node v ∈ V (m), calculating external information:

wherein a is_v′m＝sign(Lq_v′m)，

Represents check node extrinsic information, v (m) represents the set of VN nodes participating in the mth check equation, and v (m) \\ v represents the subset after element v has been removed from the set.

After the functional details of the CNP core are defined on the GPU, the above processing on the CN node is mapped to a CUDA Kernel (CNP core), and a thread Grid1 is allocated to the CNP core, and the code word parameters of the LDPC code are combined to declare each stage of GPU thread resources used by the thread Grid1, where the size of the thread Block is set as the row weight of the LDPC code check matrix. Accelerating by utilizing the multithreading characteristic of the GPU; namely, the parallel processing is carried out between each row of the check matrix H, and the independent calculation of the message is carried out between a plurality of variable nodes associated with each row in parallel.

Step 4, variable node update is completed on the GPU:

for each variable node v and m ∈ M (v), respectively calculating:

m (v) represents a set of CN nodes connected to the v-th VN node, and m (v) \\ m represents a subset after element m is removed from the set.

After the function of a variable node processing Kernel function is defined on the GPU, the processing part of the variable node is mapped into another Kernel function VNP core, and a GPU Grid2 is distributed for acceleration; namely, the columns of the check matrix H are processed in parallel, and a plurality of check nodes associated with each column are independently operated in parallel.

Try decoding decision [ step 5 ]:

if the message value LQ is equal to the value of LQ after the iteration is finished_nMore than or equal to 0 and more than or equal to zero, the judgment is made

Is 0, otherwise is 1. Each time the decision of step 4 is performed, the decision codeword

If the syndrome satisfies the check constraint condition:

the iteration terminates and will

And as decoding output, otherwise, turning to the second step to execute in sequence. If the preset maximum iteration number is not successful, the failure is declared, and the decoding iteration is finished.

The complete LDPC code coding and decoding system is constructed in a mode of joint work of a CPU platform and a GPU platform, and the GPU and the CPU are respectively compiled by using a CUDA6.0 toolkit and Microsoft Visual Stdio 2010. The code word adopts (1024, 512) LDPC code with code rate of 0.5 described in CCSDS standard, the maximum decoding iteration times is set as 64 times, and the correction factor k of the normalization minimum sum decoding algorithm is 0.75. Wherein:

on a CPU platform, generating an original information bit signal u, adopting a (1024, 512) LDPC code with a code rate of 0.5 described in a CCSDS standard, forming a code bit c belonging to {0, 1} after coding, obtaining a channel sending bit x belonging to { -1, 1} by applying BPSK modulation (x ═ 2c-1), then transmitting through an AWGN channel, adding white Gaussian noise, and obtaining a signal y belonging to x + n after noise interference at a channel receiving position. And transmitting the LLR value initialized by the channel to the GPU by utilizing a cudaMemcpy () function, and entering a decoder on the GPU platform for iterative decoding.

On a GPU platform, after sequentially finishing GPU initialization, CNP core processing and VNP core processing, decoding code words

And (4) returning to the CPU:

first, initializing a GPU: and calling a cudaMalloc () function to open up a memory space for the check matrix H, and storing the check matrix into the GPU, so that all the information to be accessed and updated by the threads is stored into a memory space of the GPU. The H matrix of the quasi-cyclic structure LDPC code is highly structured and can be divided into a plurality of sub-matrixes with the size of Z, and the sub-matrixes comprise three types: the all-zero matrix, the unit matrix and the shift matrix of the unit matrix, so that the 4 bytes can complete the storage of the matrix elements. The elements of the LDPC code check matrix belong to constants which are read only and not written in the system implementation, and are stored in a constant memory of the GPU, and the access time delay of the system can be optimized by reading data by utilizing the rapid data broadcasting characteristic of the LDPC code check matrix.

Then, using a cudaMemcpy () function to transfer the LLR value of the channel initialization to the GPU, wherein the instruction for transferring parameters in the CUDA program is:

cudaMemcpy(dev_lratio，lratio，sizeof(double)，cudaMemcpyHostToDevice)；

the four parameters sequentially indicate that the destination end of the transfer is a device (devs, namely GPU), the source end of the transfer is a host (namely CPU), the data type of the LLR value is double-type data, and the last function indicates that the transfer direction is from the host to the device, namely from the CPU to the GPU. When the decoded result is returned, the parameter transfer from the device GPU to the host CPU is also accomplished by means of the function cudamecpy ().

The second step is that: a CNP nucleus. In each iteration process, updating operation is carried out on CN nodes according to VN nodes associated with each row of the LDPC code check matrix H, each GPU thread simultaneously selects associated variable node messages to carry out independent calculation, and values are transmitted to check nodes. For a (1024, 512) LDPC code with a code rate of 0.5, whose rows of the check matrix H repeat 6, then each check node will be connected to 6 variable nodes. The main task of the CNP core of the GPU is to be able to calculate the information that each check node returns to the other associated six variable nodes simultaneously.

Each CN node calls one thread block to perform an operation, and the inside of the thread block includes 6 threads. Before using the thread Grid and the thread Block of the GPU, the execution configuration of the kernel function should be performed in the following manner, that is, the sizes of the Grid and the thread Block are declared:

dim3 dimBlock(x1，y1，1)；

the size of the Block declared at this time is x1 × y1, and after the sizes of the thread Block and the grid are declared, the functional details of the check node processing Kernel function CNP are defined, and then the Kernel function is executed. The instructions for the GPU to execute the kernel function are:

CNP<<<dimGrid1，dimBlock>>>(，，，)；

the first parameter in the tip bracket is the number of blocks in Grid, the second parameter is the number of threads in Block, and the inside contains the function parameters of the CNP core. Similar to the kernel function VNP for processing variable nodes, the decision of the VN node can also be performed on the GPU in a parallel manner.

The third step: a VNP core. And after the CN node is updated, updating the VN node according to the CN node associated with each column of the LDPC code check matrix H, and independently calculating by simultaneously selecting associated check node messages by each GPU thread and transmitting the values to the variable nodes.

And after the iteration is finished, judging the VN node on the GPU, and performing judgment operation on each bit by each thread. When the message is transmitted in the decoding process, a global memory joint access mode of the GPU is adopted, and higher access bandwidth can be obtained.

The fourth step: and returning the bit. And copying the decoded and judged bits from the GPU back to the CPU, and calculating the error rate. And free all space that was opened up on the GPU.

On CPU platform handle

And comparing the number of the error bits with the total number of the transmission bits, counting the bit error rate, and obtaining the decoding performance of the parallel decoding algorithm.

When the decoding time consumption is analyzed and compared, the CPU platform uses a clock () function to count the decoding time consumption, which can reach the order of ms. And the GPU platform counts the time consumed by decoding and adopts a CUDA profiler timing function in a CUDA toolkit.

And (3) testing results:

the method comprises the following steps of (I) correctness verification:

when the variation range of the signal-to-noise ratio is 1dB-2.6dB, the BER performance curves of the adopted CUDA parallel decoding algorithm and the CPU serial decoding algorithm are basically coincident and are obviously reduced along with the increase of the signal-to-noise ratio, and the error code rate reaches 6 multiplied by 10 when the signal-to-noise ratio is 2.6dB^-6The parallel decoding algorithm provided by the invention is proved not to cause error code performance loss.

(II) validity verification:

1. a check matrix storage mode: the total frame number of tested (1024, 512) LDPC codes is 10000 frames, only one code word is input each time during decoding, the number of GPU distributed bus threads is 256, and the check matrix storage mode respectively adopts a CPU and a constant memory of a GPU for storage. And carrying out multiple statistical averaging on the decoding time delays of the two schemes, wherein the result shows that the decoding time delay corresponding to the scheme of which the H is stored in the constant memory of the GPU is 71 percent of the decoding time delay of the scheme stored in the memory of the CPU. By utilizing the characteristic of quick data broadcasting of the GPU constant memory, unnecessary data reading time delay between a host and equipment is saved, so that the decoding throughput is improved by 1.3-1.6 times, and the method is shown in the following table:

2. the parallelism of the algorithm is as follows: the total frame number of the tested (1024, 512) LDPC code is 10000 frames, and the SNR is fixed to be 3.0 dB. In the experiment, the number of bus threads distributed by the GPU is 256, and four different parallelism decoding schemes are respectively applied to carry out time delay test: scheme 1 is traditional CPU decoding, scheme 2 only performs parallel acceleration on CN node processing on a GPU, scheme 3 only performs parallel acceleration on VN node processing on the GPU, and scheme 4 puts both processing on the GPU for parallel processing. The test result shows that the schemes 2, 3 and 4 have obvious acceleration effect compared with the original CPU decoding, wherein the acceleration ratio obtained by only carrying out parallel processing on the CN node or the VN node is about 2.5 times; the acceleration ratio achieved in a fully parallel manner is 6.1 times, as shown in the following table:

3. number of threads of invoked GPU: the total frame number is 10000 frames, the given signal-to-noise ratio is 3.0dB, only one code word is fed each time in the decoding process, and then different thread numbers are distributed to the code word for decoding. The results show that: when a small number of GPU threads (32 threads) are adopted, the parallel decoding time of the GPU platform is slightly longer than that of the original serial decoding of the CPU platform, because the acceleration effect of the small number of threads on the decoding time delay is very limited, and the time delay of data transmission between a host and equipment cannot be sufficiently offset. With the increasing number of GPU threads used for parallel decoding, the high-speed decoding characteristics of parallel decoding implemented based on the CUDA platform are gradually developed. Within a certain thread number range, the multiple of the acceleration of the decoding speed and the number of GPU threads used for parallel decoding are approximately in linear growth relationship, and the following table shows that:

the test effect shows that: the decoding algorithm of the invention can improve the decoding speed of the LDPC code and effectively improve the system throughput.

Claims

1. An LDPC code parallel decoding method based on CUDA architecture under AWGN channel is characterized by comprising the following steps:

step 1, for each non-zero element in the LDPC code check matrix H with the size of m multiplied by n, finishing the initialization of a channel on a CPU, and Lq_nm＝LLR_n＝L(x_n|y_n)＝log(P(x_n＝0|y_n)/P(x_n＝1|y_n))＝2y_n/σ²Wherein, Lq_nmRepresenting variable node-specific information, LLR_nIndicating the channel initial value, x, of bit n_nFor transmitting the n-th coded symbol, y, of a codeword after BPSK modulation_nFor receiving the nth bit, sigma, of a sequence of code words after channel transmission²Variance of AWGN channel noise;

Wherein a is_v'm＝sign(Lq_v'm) V (m) represents a VN node set participating in an mth check equation, the mth check equation refers to a check equation corresponding to an mth row of the LDPC code check matrix, and v (m) \ v represents a subset after an element v is removed from the set; lq (low-Q)_v'mThe information which is transmitted to the check node m at the solution variable node v in the check node updating process for all the check nodes m and any variable node v epsilon V (m) adjacent to the check nodes m

In this case, it is necessary to calculate likelihood ratio information Lq transferred from all variable nodes v' e v (m) \ v connected to the check node m, except the variable node v_v'mCalculating Lq by comparison_v'mIs calculated from the minimum and sign bit of

After the functional details of the CNP core are defined on the GPU, the processing of the CN node is mapped into the CNP core, a thread Grid1 is distributed to the CNP core, each grade of GPU thread resources used by the thread Grid1 are declared by combining the code word parameters of the LDPC code, and the size of a thread Block is set as the row weight of the LDPC code check matrix; parallel processing is carried out between each row of the check matrix H, and independent calculation of messages is carried out in parallel between a plurality of variable nodes associated with each row;

And

m (v) represents a set of CN nodes connected to the v-th VN node, M (v) \ m represents a subset after element m is removed from the set, and LLR_vIndicating reception of value y according to the channel_vA calculated log-likelihood ratio; in the variable node updating process, for all variable nodes v and any check node m e M (v) adjacent to the variable nodes v, a message Lq transmitted to the variable nodes v by the check nodes m is solved_vmIn the process, except the check node m, the likelihood ratio information transmitted from all other check nodes m' connected with the variable node v to the variable node v by the m (v) \\ m needs to be calculated; for variable node v, solving the total log-likelihood ratio LQ_vWhen making a decision, it is necessary to calculate the likelihood ratio information Lr transmitted to the variable node v by any check node m e m (v) adjacent to the variable node_mv；

If the syndrome satisfies the check constraint condition

The iteration terminates and will