CN114006621A

CN114006621A - Parallel decoding method and system

Info

Publication number: CN114006621A
Application number: CN202111306835.6A
Authority: CN
Inventors: 尹航; 戴景鑫; 杨占昕
Original assignee: Communication University of China
Current assignee: Communication University of China
Priority date: 2021-11-05
Filing date: 2021-11-05
Publication date: 2022-02-01

Abstract

The invention relates to a parallel decoding method and a system, the parallel decoding method introduces a SIMD instruction set to combine data to be decoded into 32-bit data, improves the universality and the parallel expansibility of the parallel decoding method, performs parallel decoding operation based on a flood scheduling decoding algorithm, improves the parallel expansibility, and further improves the throughput rate of the parallel decoding method.

Description

Parallel decoding method and system

Technical Field

The present invention relates to the field of channel coding technology, and in particular, to a parallel decoding method and system.

Background

An LDPC code is a linear block code approaching the shannon limit. Because of the excellent performance of LDPC codes approaching the shannon limit, LDPC codes have been widely used in digital communication systems, for example: 5G, WiMAX (802.16e), WiFi (802.11N), DVB-S2, and the like. However, as the requirement of the communication system for high data rate transmission is higher and higher, how to design the LDPC decoder with high throughput efficiently is a hot point of research.

High-throughput LDPC decoders are usually designed using Field-programmable gate array (FPGA) and application-specific integrated circuit (ASIC), but the LDPC decoders based on FPGA and ASIC have poor flexibility and high design cost, and have higher cost compared to Central Processing Unit (CPU) and Graphic Processing Unit (GPU) designed using software radio. The parallel processing capability of GPU devices is far superior to CPU devices because GPUs provide a parallel processing platform that combines computational power with programmability, boosting parallel computational power by executing thousands of threads. GPUs are widely used in the field of hig-performance computing (HPC), and such computationally intensive tasks as LDPC decoding are also well suited for processing by GPUs.

The realization of the LDPC decoder based on the GPU is an active research field at present, but most of current research works only optimize partial features of the LDPC decoder, resulting in limited performance improvement of throughput. For example, a patent (a QC-LDPC code accelerated decoding method based on a GPU architecture) performs an optimization design on a memory allocation strategy of a QC-LDPC decoder based on a GPU, and a patent (an LDPC code parallel decoding method based on a CUDA architecture in an AWGN channel) performs an optimization on a memory allocation strategy of a general GPU LDPC decoder in an AWGN channel. The optimization of the two patents is focused on the memory access of the GPU end of the decoder, and the parallelism and data transmission of the decoder are not improved. Documents (J.Ling and P.Cautereels, "Fast LDPC GPU Decoder for Cloud RAN," in IEEE interleaved Systems Letters, doi:10.1109/LES.2021.3052714.), and (B.L.Gal, C.Jego and J.Crene, "A High Throughput Efficient application Approach for Decoding LDPC Codes on to GPU Devices," in IEEE interleaved Systems Letters, vol.6, No.2, pp.29-32, June 2014.) combine 8bit fixed point quantization with the features of the Single instruction set, simultaneous Decoding of multiple sets of Data is achieved by the SIMD instruction set in the GPU, thereby extending the Decoding parallelism. However, the design schemes of the two documents both adopt a layred scheduling decoding algorithm, so that the decoders cannot be completely expanded in parallel during design, and the intra-code parallelism of the decoders is limited.

After analyzing the LDPC decoder based on the GPU proposed at present, it is found that the decoder adopting the current research work in the actual communication system has the following problems:

1. the existing decoder design scheme based on the GPU is only optimized from a memory allocation scheme or a decoding parallelism angle of the GPU, so that the optimization of the decoder is insufficient, the performance of the GPU cannot be fully exerted, and the serious waste of the performance is caused.

2. Although the LDPC decoder is parallelized in the existing decoder part design scheme based on the GPU, the decoding algorithm cannot be completely and parallelly expanded due to the fact that the decoding algorithm is mostly adopted based on the layred scheduling decoding algorithm, the parallelism in the code cannot reach the maximum, and the throughput rate is limited to be improved.

3. Most of the existing decoder design schemes based on the GPU only improve the inter-code parallelism by improving the number of code blocks decoded simultaneously, and the throughput improvement rate adopting the mode is limited by the performance of the GPU.

4. At present, all the decoder design schemes based on the GPU are designed aiming at general GPU equipment, and the decoding universality is lacked.

Disclosure of Invention

The invention aims to provide a parallel decoding method and a parallel decoding system, which further improve the throughput rate of LDPC decoding and improve the decoding universality.

In order to achieve the purpose, the invention provides the following scheme:

a parallel decoding method, the parallel decoding method comprising the steps of:

merging the data to be decoded in the data set to be decoded into 32-bit data by adopting an SIMD instruction set, sequencing, and storing a sequencing result in a 32-bit array; the 32-bit array uses subscript to represent the position of data to be decoded corresponding to the 32-bit data in the 32-bit array in the code word;

based on a flood scheduling decoding algorithm and a SIMD instruction set, parallel computing transmission information transmitted to different variable nodes by each check node;

based on a flood scheduling decoding algorithm and a SIMD instruction set, parallel computing transmission information transmitted from each variable node to different check nodes;

returning to the step of calculating the transmission information transmitted to different variable nodes by each check node in parallel based on the flood scheduling decoding algorithm and the SIMD instruction set until the iteration number reaches the threshold of the iteration number, and outputting the transmission information transmitted to the different variable nodes by each check node and the transmission information transmitted to the different check nodes by each variable node after the last iteration;

according to the transmission information transmitted to different variable nodes by each check node after the last iteration and the transmission information transmitted to different check nodes by each variable node, the posterior probability of each check node is calculated in parallel;

carrying out hard decision operation on the posterior probability of each variable node in parallel to obtain a decoding result;

and splitting and reordering the decoding result corresponding to each 32-bit array according to the subscript of each 32-bit array to obtain the decoding result corresponding to each data to be decoded.

Optionally, the data to be decoded in the data set to be decoded is merged into 32-bit data by using the SIMD instruction set, and is sorted, and the sorting result is stored in the 32-bit array, which specifically includes:

grouping 8-bit data to be decoded in a data set to be decoded in a mode of grouping 4 code words; if the last group is less than 4 code words, the last group is expanded into a group in a zero filling mode;

adopting a plurality of threads to execute the following operations in parallel on the data to be decoded of each group:

extracting data at the same position in the data to be decoded in each group of 4 code words and merging the data into 32-bit data;

and storing the 32-bit data of each group of 4 code words in a 32-bit array, wherein the subscript of the 32-bit array corresponds to the position of the data to be decoded in the code words.

Optionally, the parallel computing of the transfer information transferred from each check node to different variable nodes based on the Flooding scheduling decoding algorithm and the SIMD instruction set specifically includes:

using formulas

Parallelly calculating the symbol bit value of the transmission information transmitted to different variable nodes by each check node;

symbol of transfer information transferred to nth variable node according to each check nodeBit value using a formula

Parallelly calculating the transmission information transmitted to different variable nodes by each check node;

wherein the content of the first and second substances,

a sign bit value representing transfer information transferred from the mth check node to the nth variable node in the ith iteration,

representing the transfer information from the nth' variable node to the mth check node in the (i-1) th iteration, N representing the number of variable nodes,

representing the transfer information transferred from the mth check node to the nth variable node, min1 and min2 representing the minimum value and the second minimum value of the transfer information transferred from all check nodes to the variable nodes obtained before the current iteration, respectively, and β representing the offset parameter of the OMS algorithm.

using formulas

according to the symbol bit value of the transmission information transmitted to different variable nodes by each check node, using a formula

wherein the content of the first and second substances,

and representing the transfer information transferred from the nth' variable node to the mth check node in the (i-1) th iteration, wherein N represents the column number of the check matrix, min1 and min2 respectively represent the minimum value and the second minimum value in the transfer information transferred from all check nodes to the variable node and obtained before the current iteration, and alpha represents the NMS algorithm normalization parameter.

Optionally, the parallel computing of the transfer information transferred from each variable node to different check nodes based on the Flooding scheduling decoding algorithm and the SIMD instruction set specifically includes:

using formulas

The sum of the transmission information transmitted to each variable node by different check nodes is calculated in parallel;

using a formula based on the sum of the transfer information transferred from different check nodes to each variable node

Parallelly calculating the transmission information transmitted from each variable node to different check nodes;

wherein the content of the first and second substances,

representing the sum of the passed information passed by the different check nodes to the nth variable node in the ith iteration,

represents the transfer information of the mth check node to the nth variable node, M represents the number of rows of the check matrix,

representing i iterationsThe transfer information passed from the nth variable node to the mth check node in the generation,

initial data information representing the nth variable node.

Optionally, the parallel computing the posterior probability of each check node according to the transfer information transferred from each check node to different variable nodes after the last iteration and the transfer information transferred from each variable node to different check nodes specifically includes:

according to the transmission information transmitted to different variable nodes by each check node after the last iteration and the transmission information transmitted to different check nodes by each variable node, a formula is utilized

Calculating the posterior probability of each variable node in parallel;

wherein the content of the first and second substances,

represents the posterior probability of the nth variable node,

the transfer information of the nth variable node to the mth check node after the last iteration is shown,

and representing the transmission information transmitted from the mth check node to the nth variable node when m is equal to n after the last iteration, and max represents the threshold value of the iteration number.

Optionally, the parallel hard decision operation on the posterior probability of each variable node to obtain a decoding result specifically includes:

respectively using formulas

Parallelly carrying out hard decision operation on the posterior probability of each variable node to obtain decodingThe result is;

wherein the content of the first and second substances,

representing the posterior probability of the nth variable node, E_nAnd (4) representing the decoding result of the nth variable node.

Optionally, the splitting and reordering are performed on the decoding result corresponding to each 32-bit array according to the subscript of each 32-bit array, so as to obtain the decoding result corresponding to each data to be decoded, and then the method further includes:

and carrying out bit compression on the decoding result corresponding to each data to be decoded, and storing the compressed decoding result into a Sharedmemory.

A parallel coding system, the parallel coding system comprising a GPU, the GPU comprising:

the Ordering Kernel module comprises a code _ number/4N threads, and each thread in the Ordering Kernel module executes in parallel, namely, the data to be decoded in a data set to be decoded are combined into 32-bit data by adopting a SIMD instruction set, and are sequenced, and sequencing results are stored in a 32-bit array; the 32-bit array uses subscript to represent the position of data to be decoded corresponding to the 32-bit data in the 32-bit array in the code word; the code _ number represents the number of code words in a data set to be decoded, and N represents the number of columns of a check matrix;

the CN _ computer Kernel module comprises a code _ number/4M threads, and each thread in the CN _ computer Kernel module respectively executes the step of calculating the transmission information transmitted from each check node to different variable nodes in parallel;

a VN _ computer Kernel module, wherein the VN _ computer Kernel module comprises a code _ number/4N threads, and each thread in the VN _ computer Kernel module executes the step of calculating the transmission information transmitted from each variable node to different check nodes in parallel;

each thread in the VN _ computer Kernel module also executes in parallel, and when the iteration number reaches an iteration number threshold, the posterior probability of each check node is calculated according to the transmission information transmitted to different variable nodes by each check node after the last iteration and the transmission information transmitted to different check nodes by each variable node;

a Hard _ decision Kernel module, wherein the Hard _ decision Kernel module comprises a core _ number/4N threads, and each thread in the Hard _ decision Kernel module executes a Hard decision operation on the posterior probability of each variable node in parallel to obtain a decoding result;

the reordering Kernel module comprises code _ number/8N threads, and the reordering Kernel module respectively executes the steps of splitting and reordering decoding results corresponding to the 32bit array according to subscripts of the 32bit array to obtain decoding results corresponding to data to be decoded in parallel.

Optionally, the parallel decoding system includes a CPU;

the CPU is used for carrying out quantization operation on the data to be decoded in the data set to be decoded and carrying out compression operation on the check matrix, and storing the quantized data to be decoded and the check matrix information into the page-locking memory chip area.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention discloses a parallel decoding method, which comprises the following steps: merging the data to be decoded in the data set to be decoded into 32-bit data by adopting an SIMD instruction set, sequencing, and storing a sequencing result in a 32-bit array; based on a flood scheduling decoding algorithm and a SIMD instruction set, parallel computing transmission information transmitted to different variable nodes by each check node; based on a flood scheduling decoding algorithm and a SIMD instruction set, parallel computing transmission information transmitted from each variable node to different check nodes; according to the transmission information transmitted to different variable nodes by each check node after the last iteration and the transmission information transmitted to different check nodes by each variable node, the posterior probability of each check node is calculated in parallel; carrying out hard decision operation on the posterior probability of each variable node in parallel to obtain a decoding result; and splitting and reordering the decoding result corresponding to each 32-bit array according to the subscript of each 32-bit array to obtain the decoding result corresponding to each data to be decoded. The method introduces the SIMD instruction set to combine the data to be decoded into 32-bit data, improves the universality and the parallel expansibility of the parallel decoding method, performs parallel decoding operation based on the flood scheduling decoding algorithm, improves the parallel expansibility, and further improves the throughput rate of the parallel decoding method.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

Fig. 1 is a flowchart of a decoding operation performed based on a parallel decoding system according to embodiment 3 of the present invention;

fig. 2 is a schematic diagram illustrating application distribution to a GPU side memory according to embodiment 3 of the present invention;

fig. 3 is a flowchart illustrating that the GPU terminal decodes 4 codeword data according to embodiment 3 of the present invention;

fig. 4 is a flowchart of a compression operation performed on a check matrix by a CPU according to embodiment 3 of the present invention;

fig. 5 is a flowchart illustrating the conversion and sorting operations of the data to be decoded performed by the Ordering Kernel module at the GPU terminal according to embodiment 3 of the present invention;

fig. 6 is a flowchart for verifying the technical effect of the parallel decoding method and system according to embodiment 4 of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Example 1

The invention provides a parallel decoding method, which comprises the following steps:

merging the data to be decoded in the data set to be decoded into 32-bit data by adopting an SIMD instruction set, sequencing, and storing a sequencing result in a 32-bit array; and the 32-bit array uses subscript to represent the position of the data to be decoded corresponding to the 32-bit data in the 32-bit array in the code word.

The method comprises the following steps of combining data to be decoded in a data set to be decoded into 32-bit data by adopting a SIMD instruction set, sequencing the data, and storing a sequencing result in a 32-bit array, and specifically comprises the following steps: grouping 8-bit data to be decoded in a data set to be decoded in a mode of grouping 4 code words; if the last group is less than 4 code words, the last group is expanded into a group in a zero filling mode; adopting a plurality of threads to execute the following operations in parallel on the data to be decoded of each group: extracting data at the same position in the data to be decoded in each group of 4 code words and merging the data into 32-bit data; and storing the 32-bit data of each group of 4 code words in a 32-bit array, wherein the subscript of the 32-bit array corresponds to the position of the data to be decoded in the code words.

And based on the Flooding scheduling decoding algorithm and the SIMD instruction set, the transmission information transmitted to different variable nodes by each check node is calculated in parallel.

The parallel computation of the transfer information transferred from each check node to different variable nodes based on the flood scheduling decoding algorithm and the SIMD instruction set specifically comprises: using formulas

Parallelly calculating the symbol bit value of the transmission information transmitted to different variable nodes by each check node; according to the symbol bit value of the transmission information transmitted to different variable nodes by each check node, using a formula

Parallelly calculating the transmission information transmitted to different variable nodes by each check node; wherein the content of the first and second substances,

Or the parallel computing of the transfer information transferred from each check node to different variable nodes based on the Flooding scheduling decoding algorithm and the SIMD instruction set specifically includes: using formulas

Parallelly calculating the symbol bit value of the transmission information transmitted to different variable nodes by each check node; wherein the content of the first and second substances,

And based on the Flooding scheduling decoding algorithm and the SIMD instruction set, the transmission information transmitted to different check nodes by each variable node is calculated in parallel.

The parallel computing of the transfer information transferred from each variable node to different check nodes based on the Flooding scheduling decoding algorithm and the SIMD instruction set specifically includes: using formulas

The sum of the transmission information transmitted to each variable node by different check nodes is calculated in parallel; using a formula based on the sum of the transfer information transferred from different check nodes to each variable node

Parallelly calculating the transmission information transmitted from each variable node to different check nodes; wherein the content of the first and second substances,

representing the transfer information from the nth variable node to the mth check node in i iterations,

initial data information representing the nth variable node.

And returning to the step of calculating the transmission information transmitted to different variable nodes by each check node in parallel based on the flood scheduling decoding algorithm and the SIMD instruction set until the iteration number reaches the iteration number threshold, and outputting the transmission information transmitted to the different variable nodes by each check node and the transmission information transmitted to the different check nodes by each variable node after the last iteration.

And according to the transmission information transmitted to different variable nodes by each check node after the last iteration and the transmission information transmitted to different check nodes by each variable node, calculating the posterior probability of each check node in parallel.

The parallel computation of the posterior probability of each check node according to the transfer information transferred from each check node to different variable nodes after the last iteration and the transfer information transferred from each variable node to different check nodes specifically includes: according to the transmission information transmitted to different variable nodes by each check node after the last iteration and the transmission information transmitted to different check nodes by each variable node, a formula is utilized

Calculating the posterior probability of each variable node in parallel; wherein the content of the first and second substances,

represents the posterior probability of the nth variable node,

And carrying out hard decision operation on the posterior probability of each variable node in parallel to obtain a decoding result.

The parallel hard decision operation on the posterior probability of each variable node to obtain a decoding result specifically comprises the following steps: respectively using formulas

Carrying out hard decision operation on the posterior probability of each variable node in parallel to obtain a decoding result; wherein the content of the first and second substances,

The splitting and reordering of the decoding result corresponding to each 32bit array according to the subscript of each 32bit array to obtain the decoding result corresponding to each data to be decoded, and then the method further comprises the following steps: and carrying out bit compression on the decoding result corresponding to each data to be decoded, and storing the compressed decoding result into a Shared Memory.

Example 2

The present invention also provides a parallel decoding system, including a GPU, the GPU including:

the Ordering Kernel module comprises a code _ number/4N threads, and each thread in the Ordering Kernel module executes in parallel, namely, the data to be decoded in a data set to be decoded are combined into 32-bit data by adopting a SIMD instruction set, and are sequenced, and sequencing results are stored in a 32-bit array; the 32-bit array uses subscript to represent the position of data to be decoded corresponding to the 32-bit data in the 32-bit array in the code word; the code _ number represents the number of code words in the data set to be decoded, and N represents the column number of the check matrix.

and the VN _ computer Kernel module comprises a code _ number/4N threads, and each thread in the VN _ computer Kernel module executes the step of calculating the transmission information transmitted from each variable node to different check nodes in parallel.

And each thread in the VN _ computer Kernel module also executes in parallel, and when the iteration number reaches an iteration number threshold, the posterior probability of each check node is calculated according to the transmission information transmitted to different variable nodes by each check node after the last iteration and the transmission information transmitted to different check nodes by each variable node.

The Hard _ decision Kernel module comprises a code _ number/4N threads, and each thread in the Hard _ decision Kernel module executes the step of performing Hard decision operation on the posterior probability of each variable node in parallel to obtain a decoding result.

When the current decoder design scheme based on the GPU decodes large-scale data, a large amount of transmission delay is generated in data transmission between the GPU and the CPU, but most of current research works do not optimize the time, and the throughput rate of the decoder is improved, so that the data to be decoded and a check matrix are processed at the CPU end, and the transmission delay is reduced. The method specifically comprises the following steps: the parallel decoding system comprises a CPU; the CPU is used for carrying out quantization operation on the data to be decoded in the data set to be decoded and carrying out compression operation on the check matrix, and storing the quantized data to be decoded and the check matrix information into the page-locking memory chip area.

Example 3

Embodiment 3 of the present invention provides a specific implementation manner of the parallel decoding system when the parallel decoding system includes a CPU and a GPU.

Because the current main mode for improving the throughput rate of the LDPC decoder based on the GPU is only optimized aiming at partial characteristics of the decoder, the optimization of the decoder is not comprehensive enough, and various problems can be caused. Such as: the GPU performance cannot be fully exerted, and waste is caused to the performance; the scheduling structure of the adopted decoding algorithm can not be completely expanded in parallel, so that the parallelism in the code is limited; the way of improving the throughput rate is limited by the performance of the GPU; data transmission is not optimized, and the improvement of throughput rate is influenced; the decoding universality is lacked, and the optimization is only carried out aiming at general GPU equipment.

The parallel decoding system can fully utilize memory resources in the GPU, can fully and parallelly expand LDPC decoding operation, further improves the parallelism of the decoder without being limited by the performance of the GPU, can optimize data transmission between the GPU and the CPU, and can be applied to low-power-consumption embedded GPU equipment.

The parallel decoding method and the system provided by the invention are a universal high-throughput LDPC decoder, namely, the LDPC decoding operation with any code length and code rate can be realized on the premise of providing high throughput. While supporting the use of normalized min-sum (NMS) decoding algorithms or offset min-sum (OMS) decoding algorithms.

The parallel decoding method and the parallel decoding system provided by the invention have the advantages that the throughput rate is improved mainly through four points, firstly, the memory resources in the GPU are fully utilized, and the memory access efficiency during decoding is improved; secondly, the decoding algorithm adopts the Flooding scheduling, the LDPC decoding algorithm can be completely and parallelly expanded, and the parallelism in the code is maximum; thirdly, by utilizing the SIMD instruction set of the GPU, the inter-code parallelism of the decoder can be further expanded without being limited by the performance of the GPU; and fourthly, data transmission between the GPU and the CPU is optimized, and data transmission delay is reduced, so that the throughput rate is further improved.

The specific implementation steps of the parallel decoding method and system of the present invention are shown in fig. 1, the schematic diagram of the application distribution of the GPU side memory is shown in fig. 2, and the flow of decoding 4 codeword data at the GPU side is shown in fig. 3. M, N in fig. 2 and 3 are the number of rows and columns, respectively, of the check matrix H.

The parallel decoding system of the invention is mainly completed by seven steps: step one, data to be decoded is transmitted from a CPU end to a GPU end. And step two, sequencing the data to be decoded at the GPU side. And step three, decoding operation is executed at the GPU end, and the transmitted data to be decoded is decoded. And fourthly, executing hard decision operation at the GPU terminal, and obtaining a decoding result through hard decision. And fifthly, executing reordering operation on the decoding result at the GPU side. And sixthly, executing Bit compression operation at the GPU end, and compressing the decoding result after hard decision. And step seven, the decoding result after the Bit compression is transmitted back to the CPU from the GPU. The number of code words in the data to be decoded is set as code _ number, and when a check matrix H with M rows and N columns is adopted, the steps of realizing parallel decoding for the parallel decoding system are as follows:

the method comprises the following steps: transmitting data to be decoded from CPU to GPU

And transmitting the data to be decoded from the CPU end to the GPU end, wherein the specific operation of the step is as follows:

1. this step requires 8bit quantization of the data to be decoded.

The transmission data amount from the CPU end to the GPU end is only 1/4 of the 32bit quantization scheme after quantization. The specific steps of quantification are as follows:

(1) initializing an expansion coefficient Expand _ factor, a data upper limit Pos _ LLR and a data lower limit Neg _ LLR according to user requirements.

(2) Multiplying the data to be decoded by the expansion coefficient Expand _ factor, updating the data to be decoded, and rounding the updated data to be decoded.

(3) After rounding is finished, carrying out amplitude limiting on data to be decoded according to the upper limit Pos _ LLR and the lower limit Neg _ LLR.

2. After the decoded data are quantized, the check matrix H is compressed to obtain check matrix information H^c。

The check matrix H of the LDPC code comprises a large number of '0' elements, and the elements do not participate in decoding operation, thereby causing waste of storage space. In the step, only non-0 elements in the check matrix H are stored, so that the waste of storage space is reduced. Taking the example that the base matrix in 5G adopts BG1 and the check matrix with the spreading factor of 384 as an example, after the compression method of the step, the occupied space is only 0.03% of the original space. The specific steps of the compression operation are shown in fig. 4, and the specific operation steps are as follows:

(1) the number of elements in the check matrix H that are not "0" in each Row, i.e., the Row weight Row _ default of each Row, is counted.

(2) Deforming the check matrix according to the Row weight Row _ degree of each Row of the check matrix H, and sequencing the rows of the check matrix from large to small according to the Row weight Row _ degree.

(3) Counting the positions of non-0 elements in each row after sorting, and then only storing the row weight and the position information of the non-0 elements in each row of the check matrix H, namely the check matrix information H^c。

3. Data to be decoded and check matrix information H after quantization at CPU end pair^cAnd storing by adopting a page-locking memory mode.

The physical address of the paged memory is not changed after being allocated, so that the addressing operation does not need to be executed again during data transmission, and the transmission efficiency is improved. Tests show that after the page-locked memory is adopted, the transmission rate between the CPU and the GPU is improved by about 3 times.

The method comprises the following steps: calling the cuda _ chk _ alloc library, and storing the data to be decoded and the check matrix information into a paged memory chip area in the CPU.

4. Data to be decoded and check matrix information H^cAnd transmitting the paged memory chip area from the CPU end to the GPU end through a PCIe bus.

Step two: executing sorting operation on data to be decoded at GPU (graphics processing Unit)

Because the SIMD instruction set of the GPU is introduced, the data needs to be sequenced after the GPU end receives the data to be decoded.

The step carries out the sorting operation in parallel through the designed OrderingKernel module, thereby reducing the sorting time. The number of threads of Ordering is coded _ number/4 × N. The Ordering time of the Ordering Kernel module is independent of the number of decoding iterations. If the number of iterations is large, the sort time may be ignored. Temporary variables generated during sorting are also stored in Registers of the corresponding threads.

The specific operation of the Ordering Kernel module is shown in fig. 5, and the specific operation steps are as follows:

1. firstly, grouping data to be decoded, and grouping the data to be decoded of every 4 code words. And the redundant data of less than 4 code words is expanded into a group of data by zero padding.

2. And extracting the grouped data to be decoded, and extracting the data at the same position in each group of 4 code words.

3. Because the data to be decoded is subjected to 8-bit quantization, 4 pieces of 8-bit data at the same position in each group are extracted and combined into 1 piece of 32-bit data.

4. And finally storing the data to be decoded in a 32-bit array, wherein subscripts of the array correspond to the positions of the data to be decoded in the code words.

Step three: performing decoding processing operation at GPU terminal

After finishing sequencing the data to be decoded transmitted to the GPU, starting to execute decoding operation, wherein the decoding operation comprises the following specific steps:

1. an initialization operation is first performed on variables required for decoding.

In order to accelerate the speed of the initialization operation, an LLR _ IntKernel module is designed at the GPU end to complete the initialization operation in parallel. The number of threads of the LLR _ Init Kernel module is coded _ number/4N, the threads in the module correspond to each VN node, and VN node information is obtained according to data to be decoded

Initialization is performed. Due to the data to be decoded

It does not change during operation, and can be used as read-only variable, but adjacent thread will be adjacent to adjacent position during decoding

Data is read, so that the memory access times can be reduced by adopting Sharedmemory storage, and the memory access efficiency is improved.

2. The iterative decoding is started with the number of decoding iterations i set to 0.

3. And carrying out correlation operation of the Check Node (CN).

The CN operation in this step is completed by a CN _ computer Kernel module designed at the GPU side. The number of the threads of the CN _ computer Kernel module is coded _ number/4M. CN _ computer Kernel can be in the form of normalized min-sum (OMS) or offset min-sum (NMS). If the LDPC decoding algorithm is set as the OMS algorithm, the calculation formula is as follows:

if the LDPC decoding algorithm is set as the NMS algorithm, the calculation formula (2) needs to be changed into:

wherein N is the column number of the check matrix H, i is the current decoding iteration number, the parameter alpha is the NMS algorithm normalization parameter, and the parameter beta is the offset parameter of the OMS algorithm. C_mnDelivery information (C2V), V, for the mth CN to the nth VN node_nmDelivery information (V2C), sign, for the nth VN node to the mth CN node_c2vIs C_mnThe sign bit values of min1 and min2 are respectively in the transfer information for all C2VThe minimum value and the next smallest value of (c).

Information V is transmitted due to V2C in iterative decoding_nmAnd C2V conveying information C_mnThe Memory access method is frequently used by different threads, so that the Memory access efficiency can be further improved by adopting Shared Memory storage. At the same time due to sign_c2vThe min1 and min2 variables are temporary variables that are generated during operation, and thus are placed into the cache Registers of the corresponding thread.

When the code _ number is greater than or equal to 4, the step can process a plurality of code blocks in parallel through the SIMD instruction set, and the specific instruction operation is as follows:

calculating the value of the C2V transfer information

V2C routing information for reading all VN nodes connected to corresponding CN node

By xor instruction to all but n ═ m

The sign bit of (a) performs a multiply operation by an xor instruction. Using min instruction to record simultaneously during operation

Minimum min1 and next minimum min 2. Finally, calculating by combining the xor instruction, the sub instruction and the abs instruction

The value of (c).

4. Correlation operation of Variable Nodes (VN) is performed.

The VN operation of this step is performed in parallel with the VN correlation operation by the VN _ computer Kernel module designed at the GPU side. The number of the threads of VN _ computer Kernel is code _ number/4N. The specific calculation formula of each thread in VN _ computer Kernel is as follows:

where M is the number of rows of the check matrix H, V_nmFor calculating the V2C transfer information for the nth VN to the mth CN, sum is the C2V transfer information C for the CN connected to the VN_mnThe sum of the total weight of the components,

for the aposteriori probability (APP) data after each decoding iteration,

is the original VN information data.

The sum variable is stored in the cache Registers of the corresponding thread.

calculating the value of V2C transfer information

The C2V transfer information of all CN nodes connected with the corresponding VN node is firstly calculated through add instruction

Sum of the values sum. Then completing V2N according to sum value through sub instruction

And calculating values and limiting the data through max and min instructions.

5. And adding 1 to the decoding iteration number and judging, if the decoding iteration number does not exceed the maximum decoding iteration number Max _ iter _ number, restarting to execute the operation from the step 3, and otherwise stopping decoding and entering the next step.

Step four: executing hard decision operation at GPU end to obtain decoding result

Requiring VN nodes after reaching the maximum number of iterations

The value is hard-judged to obtain a decoding result E_n. In the step, a Hard _ decision Kernel module is designed to perform parallel processing on Hard decisions. The route number of the Hard _ precision Kernel is coded _ number/4 × N. When in use

The judgment result is 1 when the judgment result is greater than 0, and the judgment result is 0 when the judgment result is less than 0.

The hard decision is specifically formulated as follows:

when the code _ number is greater than or equal to 4, the parallelization processing mode is performed in combination with the SIMD instruction set in the step, and the specific instruction operation is as follows:

the hard decision operation is completed by calling a min instruction, and when the hard decision operation is greater than 0, the decision result is 1, and when the hard decision operation is less than 0, the decision result is 0.

Step five: reordering decoding results at GPU end

Since the data is sorted before being decoded, the data needs to be reordered and put back to the original position after the decoding is finished.

In the reordering operation of the step, reordering is performed by a reordering Kernel module designed at the GPU end, so that reordering time is reduced. The number of threads of the recording Kernel is coded _ number/8 × N. ReoderingKernel is performed only once after the end of decoding, and the ordering time is independent of the decoding iteration. If the number of iterations is large, the sort time may be ignored. The specific operation steps are shown in fig. 5, and the specific operation steps are as follows:

1. and splitting a decoding result. Each 32-bit data is split into 4 8-bit data.

2. And after splitting, putting the split data back to the corresponding positions of the 4 code words according to the subscript of the 32bit data array.

Step six: performing Bit compression processing on the decoding result at the GPU end

After the reordering is finished, the decoding result is subjected to Bit compression processing in the step.

In the Bit compression operation of the step, a Bit _ packet Kernel module designed at the GPU end is used for compressing data. The thread number of the Bit _ packedKernel module is coded _ number/8N, and the final decoding result is subjected to data compression in the step, so that the occupation of a storage space is reduced. Because the decoding results obtained by hard decision after the decoding iteration is finished are all '0' or '1', each data can only occupy 1bit of space through binary representation. This step merges every 8 data into 1 byte of data by shift operation, and only passes "0" and "1" information back to the CPU. After compression, the storage space of the decoding result becomes 1/8 as it is. The Bit compression processing is completed through a Bit-PackedKernel at the GPU end, and each thread in the Bit-PackedKernel processes 8 decoding result data. The APP _ Bit data is a final decoding result obtained by Bit compression, the data volume is small, and in order to improve the access efficiency, Shared Memory storage is adopted. The specific operation of each thread in the Bit-PackedKernel is as follows:

1. firstly, 8 adjacent decoding result data are extracted.

2. The 8 data are merged into 1 byte of data by a shift operation.

3. And storing the merged data into Sharedmemory.

Step seven: and returning the APP _ Bit data after Bit compression from a Sharedmemory memory area at the GPU end to a memory at the CPU end.

Meanwhile, the invention adopts an 8-Bit quantization mode and compresses the decoding result in a Bit compression mode, so that the data volume transmitted from the GPU back to the CPU is further compressed, and the transmission time is changed into 1/32 of a 32-Bit quantization mode scheme.

Example 4

At present, the main mode for improving the throughput rate of the LDPC decoder based on the GPU is only optimized aiming at partial characteristics of the decoder, but various problems can be caused due to incomplete optimization of the decoder. Such as: the GPU performance cannot be fully exerted, and waste is caused to the performance; the scheduling structure of the adopted decoding algorithm cannot be completely expanded in parallel, and the parallelism is limited; the way of improving the throughput rate is limited by the performance of the GPU; data transmission is not optimized, and the improvement of throughput rate is influenced; the decoding universality is lacked, and the optimization is only carried out aiming at general GPU equipment.

The GPU general high-throughput LDPC decoder designed by the invention is tested on a general RTX2080Ti platform, and the throughput can reach 8.67Gb/s at most when 10 times of iterative decoding are carried out. Meanwhile, in order to verify the decoding universality of the LDPC decoder, the test is carried out on a low-power-consumption embedded GPU device Jetson Xavier NX, and the throughput rate can reach 1.14Gb/s under 10 iterations. The verification flow chart of the test is shown in fig. 6.

The parallel decoding method and the parallel decoding system provided by the invention can realize the LDPC code decoding operation with any code length and code rate on the premise of providing high throughput rate. While supporting the use of normalized min-sum (NMS) decoding algorithms or offset min-sum (OMS) decoding algorithms. The concrete lifting has the following points:

(1) the invention adopts the decoding scheduling algorithm of the most suitable merging processing, can fully and parallelly expand the decoding algorithm during design, furthest improves the intra-code parallelism of the decoder, fully exerts the performance of the GPU and furthest improves the throughput rate of the LDPC decoder.

(2) The invention fully optimizes the memory allocation strategy of the decoder at the GPU end, and improves the memory access efficiency during decoding. For example: the check matrix H is compressed, taking the example that BG1 is adopted as the base matrix in 5G and the check matrix with the spreading factor of 384 is adopted, and the storage space occupied by the compression method in this document is only 0.03% of the original storage space.

(3) The design method provided by the invention introduces the SIMD instruction set in the GPU, and improves the intersymbol parallelism without being influenced by the performance of the GPU. The invention adopts 8-bit fixed point quantization processing data, fully utilizes the SIMD instruction set with 32-bit width in the GPU, and expands the parallelism between blocks to 4 times of the original parallelism.

(4) The design scheme provided by the invention optimizes the data transmission time between the GPU and the CPU during decoding, and reduces the data transmission delay. The invention adopts an 8-bit quantization processing mode, so that the transmission time from the CPU to the GPU is changed into 1/4 originally. The invention performs a Bit compression operation on the decoded result, and the transmission time from the GPU to the CPU is changed to 1/8 as it is.

(5) The design scheme provided by the invention can also achieve high throughput performance on low-power-consumption embedded GPU equipment, and has good decoding universality.

The high-throughput LDPC decoding method and system based on the GPU fully utilize the memory resources in the GPU and improve the memory access efficiency during decoding; the LDPC decoding algorithm is completely and parallelly expanded, so that the parallelism in the code is maximum; the SIMD instruction set of the GPU is combined, so that the inter-code parallelism of the decoder can be further expanded without being limited by the performance of the GPU; data transmission between the GPU and the CPU is optimized, and data transmission delay is reduced, so that the throughput rate is further improved; the method has good decoding universality and can be suitable for low-power-consumption embedded GPU equipment.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A parallel decoding method, characterized in that the parallel decoding method comprises the steps of:

2. The parallel decoding method according to claim 1, wherein the merging the data to be decoded in the data set to be decoded into 32-bit data by using the SIMD instruction set, and performing the sorting, wherein the sorting result is stored in a 32-bit array, specifically comprising:

3. The parallel decoding method according to claim 1, wherein the parallel computing of the transfer information transferred from each check node to different variable nodes based on the Flooding scheduling decoding algorithm and the SIMD instruction set specifically comprises:

using formulas

wherein the content of the first and second substances,

representing the transfer information transferred from the mth check node to the nth variable node in the ith iteration, min1 and min2 respectively representing the minimum value and the second minimum value in the transfer information transferred from all check nodes to the variable nodes obtained before the current iteration, and β representing the offset parameter of the OMS algorithm.

4. The parallel decoding method according to claim 1, wherein the parallel computing of the transfer information transferred from each check node to different variable nodes based on the Flooding scheduling decoding algorithm and the SIMD instruction set specifically comprises:

using formulas

wherein the content of the first and second substances,

representing the transfer information of the nth' variable node transferred to the mth check node in the (i-1) th iteration, N representing the column number of the check matrix, min1 and min2 representing the minimum value and the second minimum value of the transfer information of all check nodes transferred to the variable node obtained before the current iteration respectively, and alpha representing NMThe S algorithm normalizes the parameters of the algorithm,

representing the transfer information passed from the mth check node to the nth variable node in the ith iteration,

and (4) showing.

5. The parallel decoding method according to claim 1, wherein the parallel computing of the transfer information transferred from each variable node to different check nodes based on the Flooding scheduling decoding algorithm and the SIMD instruction set specifically comprises:

using formulas

wherein the content of the first and second substances,

indicating that the nth variable node passes to the nth variable node in i iterationsThe delivery information of the m check nodes is,

initial data information representing the nth variable node.

6. The parallel decoding method according to claim 1, wherein the parallel computing of the a posteriori probability of each check node according to the transfer information of each check node to different variable nodes and the transfer information of each variable node to different check nodes after the last iteration specifically comprises:

Calculating the posterior probability of each variable node in parallel;

wherein the content of the first and second substances,

represents the posterior probability of the nth variable node,

7. The parallel decoding method according to claim 1, wherein the parallel hard decision operation is performed on the a posteriori probability of each variable node to obtain a decoding result, and specifically comprises:

respectively using formulas

wherein the content of the first and second substances,

8. The parallel decoding method according to claim 1, wherein the splitting and reordering of the decoding result corresponding to each 32-bit array according to the subscript of each 32-bit array to obtain the decoding result corresponding to each data to be decoded, and thereafter further comprising:

9. A parallel coding system, comprising a GPU, the GPU comprising:

the CN _ computer Kernel module comprises a code _ number/4M threads, and each thread in the CN _ computer Kernel module executes the step of calculating the transmission information transmitted from each check node to different variable nodes in parallel, wherein M represents the number of rows of the check matrix;

10. The parallel decoding system of claim 9, wherein the parallel decoding system comprises a CPU;