CN114006621A - Parallel decoding method and system - Google Patents

Parallel decoding method and system Download PDF

Info

Publication number
CN114006621A
CN114006621A CN202111306835.6A CN202111306835A CN114006621A CN 114006621 A CN114006621 A CN 114006621A CN 202111306835 A CN202111306835 A CN 202111306835A CN 114006621 A CN114006621 A CN 114006621A
Authority
CN
China
Prior art keywords
node
data
check
parallel
decoded
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111306835.6A
Other languages
Chinese (zh)
Inventor
尹航
戴景鑫
杨占昕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Communication University of China
Original Assignee
Communication University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Communication University of China filed Critical Communication University of China
Priority to CN202111306835.6A priority Critical patent/CN114006621A/en
Publication of CN114006621A publication Critical patent/CN114006621A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/03Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words
    • H03M13/05Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits
    • H03M13/11Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits using multiple parity bits
    • H03M13/1102Codes on graphs and decoding on graphs, e.g. low-density parity check [LDPC] codes
    • H03M13/1105Decoding
    • H03M13/1131Scheduling of bit node or check node processing
    • H03M13/1134Full parallel processing, i.e. all bit nodes or check nodes are processed in parallel
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/03Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words
    • H03M13/05Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits
    • H03M13/11Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits using multiple parity bits
    • H03M13/1102Codes on graphs and decoding on graphs, e.g. low-density parity check [LDPC] codes
    • H03M13/1105Decoding
    • H03M13/1108Hard decision decoding, e.g. bit flipping, modified or weighted bit flipping

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Error Detection And Correction (AREA)

Abstract

The invention relates to a parallel decoding method and a system, the parallel decoding method introduces a SIMD instruction set to combine data to be decoded into 32-bit data, improves the universality and the parallel expansibility of the parallel decoding method, performs parallel decoding operation based on a flood scheduling decoding algorithm, improves the parallel expansibility, and further improves the throughput rate of the parallel decoding method.

Description

Parallel decoding method and system
Technical Field
The present invention relates to the field of channel coding technology, and in particular, to a parallel decoding method and system.
Background
An LDPC code is a linear block code approaching the shannon limit. Because of the excellent performance of LDPC codes approaching the shannon limit, LDPC codes have been widely used in digital communication systems, for example: 5G, WiMAX (802.16e), WiFi (802.11N), DVB-S2, and the like. However, as the requirement of the communication system for high data rate transmission is higher and higher, how to design the LDPC decoder with high throughput efficiently is a hot point of research.
High-throughput LDPC decoders are usually designed using Field-programmable gate array (FPGA) and application-specific integrated circuit (ASIC), but the LDPC decoders based on FPGA and ASIC have poor flexibility and high design cost, and have higher cost compared to Central Processing Unit (CPU) and Graphic Processing Unit (GPU) designed using software radio. The parallel processing capability of GPU devices is far superior to CPU devices because GPUs provide a parallel processing platform that combines computational power with programmability, boosting parallel computational power by executing thousands of threads. GPUs are widely used in the field of hig-performance computing (HPC), and such computationally intensive tasks as LDPC decoding are also well suited for processing by GPUs.
The realization of the LDPC decoder based on the GPU is an active research field at present, but most of current research works only optimize partial features of the LDPC decoder, resulting in limited performance improvement of throughput. For example, a patent (a QC-LDPC code accelerated decoding method based on a GPU architecture) performs an optimization design on a memory allocation strategy of a QC-LDPC decoder based on a GPU, and a patent (an LDPC code parallel decoding method based on a CUDA architecture in an AWGN channel) performs an optimization on a memory allocation strategy of a general GPU LDPC decoder in an AWGN channel. The optimization of the two patents is focused on the memory access of the GPU end of the decoder, and the parallelism and data transmission of the decoder are not improved. Documents (J.Ling and P.Cautereels, "Fast LDPC GPU Decoder for Cloud RAN," in IEEE interleaved Systems Letters, doi:10.1109/LES.2021.3052714.), and (B.L.Gal, C.Jego and J.Crene, "A High Throughput Efficient application Approach for Decoding LDPC Codes on to GPU Devices," in IEEE interleaved Systems Letters, vol.6, No.2, pp.29-32, June 2014.) combine 8bit fixed point quantization with the features of the Single instruction set, simultaneous Decoding of multiple sets of Data is achieved by the SIMD instruction set in the GPU, thereby extending the Decoding parallelism. However, the design schemes of the two documents both adopt a layred scheduling decoding algorithm, so that the decoders cannot be completely expanded in parallel during design, and the intra-code parallelism of the decoders is limited.
After analyzing the LDPC decoder based on the GPU proposed at present, it is found that the decoder adopting the current research work in the actual communication system has the following problems:
1. the existing decoder design scheme based on the GPU is only optimized from a memory allocation scheme or a decoding parallelism angle of the GPU, so that the optimization of the decoder is insufficient, the performance of the GPU cannot be fully exerted, and the serious waste of the performance is caused.
2. Although the LDPC decoder is parallelized in the existing decoder part design scheme based on the GPU, the decoding algorithm cannot be completely and parallelly expanded due to the fact that the decoding algorithm is mostly adopted based on the layred scheduling decoding algorithm, the parallelism in the code cannot reach the maximum, and the throughput rate is limited to be improved.
3. Most of the existing decoder design schemes based on the GPU only improve the inter-code parallelism by improving the number of code blocks decoded simultaneously, and the throughput improvement rate adopting the mode is limited by the performance of the GPU.
4. At present, all the decoder design schemes based on the GPU are designed aiming at general GPU equipment, and the decoding universality is lacked.
Disclosure of Invention
The invention aims to provide a parallel decoding method and a parallel decoding system, which further improve the throughput rate of LDPC decoding and improve the decoding universality.
In order to achieve the purpose, the invention provides the following scheme:
a parallel decoding method, the parallel decoding method comprising the steps of:
merging the data to be decoded in the data set to be decoded into 32-bit data by adopting an SIMD instruction set, sequencing, and storing a sequencing result in a 32-bit array; the 32-bit array uses subscript to represent the position of data to be decoded corresponding to the 32-bit data in the 32-bit array in the code word;
based on a flood scheduling decoding algorithm and a SIMD instruction set, parallel computing transmission information transmitted to different variable nodes by each check node;
based on a flood scheduling decoding algorithm and a SIMD instruction set, parallel computing transmission information transmitted from each variable node to different check nodes;
returning to the step of calculating the transmission information transmitted to different variable nodes by each check node in parallel based on the flood scheduling decoding algorithm and the SIMD instruction set until the iteration number reaches the threshold of the iteration number, and outputting the transmission information transmitted to the different variable nodes by each check node and the transmission information transmitted to the different check nodes by each variable node after the last iteration;
according to the transmission information transmitted to different variable nodes by each check node after the last iteration and the transmission information transmitted to different check nodes by each variable node, the posterior probability of each check node is calculated in parallel;
carrying out hard decision operation on the posterior probability of each variable node in parallel to obtain a decoding result;
and splitting and reordering the decoding result corresponding to each 32-bit array according to the subscript of each 32-bit array to obtain the decoding result corresponding to each data to be decoded.
Optionally, the data to be decoded in the data set to be decoded is merged into 32-bit data by using the SIMD instruction set, and is sorted, and the sorting result is stored in the 32-bit array, which specifically includes:
grouping 8-bit data to be decoded in a data set to be decoded in a mode of grouping 4 code words; if the last group is less than 4 code words, the last group is expanded into a group in a zero filling mode;
adopting a plurality of threads to execute the following operations in parallel on the data to be decoded of each group:
extracting data at the same position in the data to be decoded in each group of 4 code words and merging the data into 32-bit data;
and storing the 32-bit data of each group of 4 code words in a 32-bit array, wherein the subscript of the 32-bit array corresponds to the position of the data to be decoded in the code words.
Optionally, the parallel computing of the transfer information transferred from each check node to different variable nodes based on the Flooding scheduling decoding algorithm and the SIMD instruction set specifically includes:
using formulas
Figure BDA0003340615400000041
Parallelly calculating the symbol bit value of the transmission information transmitted to different variable nodes by each check node;
symbol of transfer information transferred to nth variable node according to each check nodeBit value using a formula
Figure BDA0003340615400000042
Parallelly calculating the transmission information transmitted to different variable nodes by each check node;
wherein the content of the first and second substances,
Figure BDA0003340615400000043
a sign bit value representing transfer information transferred from the mth check node to the nth variable node in the ith iteration,
Figure BDA0003340615400000044
representing the transfer information from the nth' variable node to the mth check node in the (i-1) th iteration, N representing the number of variable nodes,
Figure BDA0003340615400000045
representing the transfer information transferred from the mth check node to the nth variable node, min1 and min2 representing the minimum value and the second minimum value of the transfer information transferred from all check nodes to the variable nodes obtained before the current iteration, respectively, and β representing the offset parameter of the OMS algorithm.
Optionally, the parallel computing of the transfer information transferred from each check node to different variable nodes based on the Flooding scheduling decoding algorithm and the SIMD instruction set specifically includes:
using formulas
Figure BDA0003340615400000046
Parallelly calculating the symbol bit value of the transmission information transmitted to different variable nodes by each check node;
according to the symbol bit value of the transmission information transmitted to different variable nodes by each check node, using a formula
Figure BDA0003340615400000047
Parallelly calculating the symbol bit value of the transmission information transmitted to different variable nodes by each check node;
wherein the content of the first and second substances,
Figure BDA0003340615400000048
a sign bit value representing transfer information transferred from the mth check node to the nth variable node in the ith iteration,
Figure BDA0003340615400000049
and representing the transfer information transferred from the nth' variable node to the mth check node in the (i-1) th iteration, wherein N represents the column number of the check matrix, min1 and min2 respectively represent the minimum value and the second minimum value in the transfer information transferred from all check nodes to the variable node and obtained before the current iteration, and alpha represents the NMS algorithm normalization parameter.
Optionally, the parallel computing of the transfer information transferred from each variable node to different check nodes based on the Flooding scheduling decoding algorithm and the SIMD instruction set specifically includes:
using formulas
Figure BDA0003340615400000051
The sum of the transmission information transmitted to each variable node by different check nodes is calculated in parallel;
using a formula based on the sum of the transfer information transferred from different check nodes to each variable node
Figure BDA0003340615400000052
Parallelly calculating the transmission information transmitted from each variable node to different check nodes;
wherein the content of the first and second substances,
Figure BDA0003340615400000053
representing the sum of the passed information passed by the different check nodes to the nth variable node in the ith iteration,
Figure BDA0003340615400000054
represents the transfer information of the mth check node to the nth variable node, M represents the number of rows of the check matrix,
Figure BDA0003340615400000055
representing i iterationsThe transfer information passed from the nth variable node to the mth check node in the generation,
Figure BDA0003340615400000056
initial data information representing the nth variable node.
Optionally, the parallel computing the posterior probability of each check node according to the transfer information transferred from each check node to different variable nodes after the last iteration and the transfer information transferred from each variable node to different check nodes specifically includes:
according to the transmission information transmitted to different variable nodes by each check node after the last iteration and the transmission information transmitted to different check nodes by each variable node, a formula is utilized
Figure BDA0003340615400000057
Calculating the posterior probability of each variable node in parallel;
wherein the content of the first and second substances,
Figure BDA0003340615400000058
represents the posterior probability of the nth variable node,
Figure BDA0003340615400000059
the transfer information of the nth variable node to the mth check node after the last iteration is shown,
Figure BDA00033406154000000510
and representing the transmission information transmitted from the mth check node to the nth variable node when m is equal to n after the last iteration, and max represents the threshold value of the iteration number.
Optionally, the parallel hard decision operation on the posterior probability of each variable node to obtain a decoding result specifically includes:
respectively using formulas
Figure BDA00033406154000000511
Parallelly carrying out hard decision operation on the posterior probability of each variable node to obtain decodingThe result is;
wherein the content of the first and second substances,
Figure BDA00033406154000000512
representing the posterior probability of the nth variable node, EnAnd (4) representing the decoding result of the nth variable node.
Optionally, the splitting and reordering are performed on the decoding result corresponding to each 32-bit array according to the subscript of each 32-bit array, so as to obtain the decoding result corresponding to each data to be decoded, and then the method further includes:
and carrying out bit compression on the decoding result corresponding to each data to be decoded, and storing the compressed decoding result into a Sharedmemory.
A parallel coding system, the parallel coding system comprising a GPU, the GPU comprising:
the Ordering Kernel module comprises a code _ number/4N threads, and each thread in the Ordering Kernel module executes in parallel, namely, the data to be decoded in a data set to be decoded are combined into 32-bit data by adopting a SIMD instruction set, and are sequenced, and sequencing results are stored in a 32-bit array; the 32-bit array uses subscript to represent the position of data to be decoded corresponding to the 32-bit data in the 32-bit array in the code word; the code _ number represents the number of code words in a data set to be decoded, and N represents the number of columns of a check matrix;
the CN _ computer Kernel module comprises a code _ number/4M threads, and each thread in the CN _ computer Kernel module respectively executes the step of calculating the transmission information transmitted from each check node to different variable nodes in parallel;
a VN _ computer Kernel module, wherein the VN _ computer Kernel module comprises a code _ number/4N threads, and each thread in the VN _ computer Kernel module executes the step of calculating the transmission information transmitted from each variable node to different check nodes in parallel;
each thread in the VN _ computer Kernel module also executes in parallel, and when the iteration number reaches an iteration number threshold, the posterior probability of each check node is calculated according to the transmission information transmitted to different variable nodes by each check node after the last iteration and the transmission information transmitted to different check nodes by each variable node;
a Hard _ decision Kernel module, wherein the Hard _ decision Kernel module comprises a core _ number/4N threads, and each thread in the Hard _ decision Kernel module executes a Hard decision operation on the posterior probability of each variable node in parallel to obtain a decoding result;
the reordering Kernel module comprises code _ number/8N threads, and the reordering Kernel module respectively executes the steps of splitting and reordering decoding results corresponding to the 32bit array according to subscripts of the 32bit array to obtain decoding results corresponding to data to be decoded in parallel.
Optionally, the parallel decoding system includes a CPU;
the CPU is used for carrying out quantization operation on the data to be decoded in the data set to be decoded and carrying out compression operation on the check matrix, and storing the quantized data to be decoded and the check matrix information into the page-locking memory chip area.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention discloses a parallel decoding method, which comprises the following steps: merging the data to be decoded in the data set to be decoded into 32-bit data by adopting an SIMD instruction set, sequencing, and storing a sequencing result in a 32-bit array; based on a flood scheduling decoding algorithm and a SIMD instruction set, parallel computing transmission information transmitted to different variable nodes by each check node; based on a flood scheduling decoding algorithm and a SIMD instruction set, parallel computing transmission information transmitted from each variable node to different check nodes; according to the transmission information transmitted to different variable nodes by each check node after the last iteration and the transmission information transmitted to different check nodes by each variable node, the posterior probability of each check node is calculated in parallel; carrying out hard decision operation on the posterior probability of each variable node in parallel to obtain a decoding result; and splitting and reordering the decoding result corresponding to each 32-bit array according to the subscript of each 32-bit array to obtain the decoding result corresponding to each data to be decoded. The method introduces the SIMD instruction set to combine the data to be decoded into 32-bit data, improves the universality and the parallel expansibility of the parallel decoding method, performs parallel decoding operation based on the flood scheduling decoding algorithm, improves the parallel expansibility, and further improves the throughput rate of the parallel decoding method.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
Fig. 1 is a flowchart of a decoding operation performed based on a parallel decoding system according to embodiment 3 of the present invention;
fig. 2 is a schematic diagram illustrating application distribution to a GPU side memory according to embodiment 3 of the present invention;
fig. 3 is a flowchart illustrating that the GPU terminal decodes 4 codeword data according to embodiment 3 of the present invention;
fig. 4 is a flowchart of a compression operation performed on a check matrix by a CPU according to embodiment 3 of the present invention;
fig. 5 is a flowchart illustrating the conversion and sorting operations of the data to be decoded performed by the Ordering Kernel module at the GPU terminal according to embodiment 3 of the present invention;
fig. 6 is a flowchart for verifying the technical effect of the parallel decoding method and system according to embodiment 4 of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a parallel decoding method and a parallel decoding system, which further improve the throughput rate of LDPC decoding and improve the decoding universality.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Example 1
The invention provides a parallel decoding method, which comprises the following steps:
merging the data to be decoded in the data set to be decoded into 32-bit data by adopting an SIMD instruction set, sequencing, and storing a sequencing result in a 32-bit array; and the 32-bit array uses subscript to represent the position of the data to be decoded corresponding to the 32-bit data in the 32-bit array in the code word.
The method comprises the following steps of combining data to be decoded in a data set to be decoded into 32-bit data by adopting a SIMD instruction set, sequencing the data, and storing a sequencing result in a 32-bit array, and specifically comprises the following steps: grouping 8-bit data to be decoded in a data set to be decoded in a mode of grouping 4 code words; if the last group is less than 4 code words, the last group is expanded into a group in a zero filling mode; adopting a plurality of threads to execute the following operations in parallel on the data to be decoded of each group: extracting data at the same position in the data to be decoded in each group of 4 code words and merging the data into 32-bit data; and storing the 32-bit data of each group of 4 code words in a 32-bit array, wherein the subscript of the 32-bit array corresponds to the position of the data to be decoded in the code words.
And based on the Flooding scheduling decoding algorithm and the SIMD instruction set, the transmission information transmitted to different variable nodes by each check node is calculated in parallel.
The parallel computation of the transfer information transferred from each check node to different variable nodes based on the flood scheduling decoding algorithm and the SIMD instruction set specifically comprises: using formulas
Figure BDA0003340615400000091
Parallelly calculating the symbol bit value of the transmission information transmitted to different variable nodes by each check node; according to the symbol bit value of the transmission information transmitted to different variable nodes by each check node, using a formula
Figure BDA0003340615400000092
Parallelly calculating the transmission information transmitted to different variable nodes by each check node; wherein the content of the first and second substances,
Figure BDA0003340615400000093
a sign bit value representing transfer information transferred from the mth check node to the nth variable node in the ith iteration,
Figure BDA0003340615400000094
representing the transfer information from the nth' variable node to the mth check node in the (i-1) th iteration, N representing the number of variable nodes,
Figure BDA0003340615400000095
representing the transfer information transferred from the mth check node to the nth variable node, min1 and min2 representing the minimum value and the second minimum value of the transfer information transferred from all check nodes to the variable nodes obtained before the current iteration, respectively, and β representing the offset parameter of the OMS algorithm.
Or the parallel computing of the transfer information transferred from each check node to different variable nodes based on the Flooding scheduling decoding algorithm and the SIMD instruction set specifically includes: using formulas
Figure BDA0003340615400000096
Parallelly calculating the symbol bit value of the transmission information transmitted to different variable nodes by each check node; according to the symbol bit value of the transmission information transmitted to different variable nodes by each check node, using a formula
Figure BDA0003340615400000097
Parallelly calculating the symbol bit value of the transmission information transmitted to different variable nodes by each check node; wherein the content of the first and second substances,
Figure BDA0003340615400000098
a sign bit value representing transfer information transferred from the mth check node to the nth variable node in the ith iteration,
Figure BDA0003340615400000099
and representing the transfer information transferred from the nth' variable node to the mth check node in the (i-1) th iteration, wherein N represents the column number of the check matrix, min1 and min2 respectively represent the minimum value and the second minimum value in the transfer information transferred from all check nodes to the variable node and obtained before the current iteration, and alpha represents the NMS algorithm normalization parameter.
And based on the Flooding scheduling decoding algorithm and the SIMD instruction set, the transmission information transmitted to different check nodes by each variable node is calculated in parallel.
The parallel computing of the transfer information transferred from each variable node to different check nodes based on the Flooding scheduling decoding algorithm and the SIMD instruction set specifically includes: using formulas
Figure BDA0003340615400000101
The sum of the transmission information transmitted to each variable node by different check nodes is calculated in parallel; using a formula based on the sum of the transfer information transferred from different check nodes to each variable node
Figure BDA0003340615400000102
Parallelly calculating the transmission information transmitted from each variable node to different check nodes; wherein the content of the first and second substances,
Figure BDA0003340615400000103
representing the sum of the passed information passed by the different check nodes to the nth variable node in the ith iteration,
Figure BDA0003340615400000104
represents the transfer information of the mth check node to the nth variable node, M represents the number of rows of the check matrix,
Figure BDA0003340615400000105
representing the transfer information from the nth variable node to the mth check node in i iterations,
Figure BDA0003340615400000106
initial data information representing the nth variable node.
And returning to the step of calculating the transmission information transmitted to different variable nodes by each check node in parallel based on the flood scheduling decoding algorithm and the SIMD instruction set until the iteration number reaches the iteration number threshold, and outputting the transmission information transmitted to the different variable nodes by each check node and the transmission information transmitted to the different check nodes by each variable node after the last iteration.
And according to the transmission information transmitted to different variable nodes by each check node after the last iteration and the transmission information transmitted to different check nodes by each variable node, calculating the posterior probability of each check node in parallel.
The parallel computation of the posterior probability of each check node according to the transfer information transferred from each check node to different variable nodes after the last iteration and the transfer information transferred from each variable node to different check nodes specifically includes: according to the transmission information transmitted to different variable nodes by each check node after the last iteration and the transmission information transmitted to different check nodes by each variable node, a formula is utilized
Figure BDA0003340615400000107
Calculating the posterior probability of each variable node in parallel; wherein the content of the first and second substances,
Figure BDA0003340615400000108
represents the posterior probability of the nth variable node,
Figure BDA0003340615400000109
the transfer information of the nth variable node to the mth check node after the last iteration is shown,
Figure BDA00033406154000001010
and representing the transmission information transmitted from the mth check node to the nth variable node when m is equal to n after the last iteration, and max represents the threshold value of the iteration number.
And carrying out hard decision operation on the posterior probability of each variable node in parallel to obtain a decoding result.
The parallel hard decision operation on the posterior probability of each variable node to obtain a decoding result specifically comprises the following steps: respectively using formulas
Figure BDA00033406154000001011
Carrying out hard decision operation on the posterior probability of each variable node in parallel to obtain a decoding result; wherein the content of the first and second substances,
Figure BDA0003340615400000111
representing the posterior probability of the nth variable node, EnAnd (4) representing the decoding result of the nth variable node.
And splitting and reordering the decoding result corresponding to each 32-bit array according to the subscript of each 32-bit array to obtain the decoding result corresponding to each data to be decoded.
The splitting and reordering of the decoding result corresponding to each 32bit array according to the subscript of each 32bit array to obtain the decoding result corresponding to each data to be decoded, and then the method further comprises the following steps: and carrying out bit compression on the decoding result corresponding to each data to be decoded, and storing the compressed decoding result into a Shared Memory.
Example 2
The present invention also provides a parallel decoding system, including a GPU, the GPU including:
the Ordering Kernel module comprises a code _ number/4N threads, and each thread in the Ordering Kernel module executes in parallel, namely, the data to be decoded in a data set to be decoded are combined into 32-bit data by adopting a SIMD instruction set, and are sequenced, and sequencing results are stored in a 32-bit array; the 32-bit array uses subscript to represent the position of data to be decoded corresponding to the 32-bit data in the 32-bit array in the code word; the code _ number represents the number of code words in the data set to be decoded, and N represents the column number of the check matrix.
The CN _ computer Kernel module comprises a code _ number/4M threads, and each thread in the CN _ computer Kernel module respectively executes the step of calculating the transmission information transmitted from each check node to different variable nodes in parallel;
and the VN _ computer Kernel module comprises a code _ number/4N threads, and each thread in the VN _ computer Kernel module executes the step of calculating the transmission information transmitted from each variable node to different check nodes in parallel.
And each thread in the VN _ computer Kernel module also executes in parallel, and when the iteration number reaches an iteration number threshold, the posterior probability of each check node is calculated according to the transmission information transmitted to different variable nodes by each check node after the last iteration and the transmission information transmitted to different check nodes by each variable node.
The Hard _ decision Kernel module comprises a code _ number/4N threads, and each thread in the Hard _ decision Kernel module executes the step of performing Hard decision operation on the posterior probability of each variable node in parallel to obtain a decoding result.
The reordering Kernel module comprises code _ number/8N threads, and the reordering Kernel module respectively executes the steps of splitting and reordering decoding results corresponding to the 32bit array according to subscripts of the 32bit array to obtain decoding results corresponding to data to be decoded in parallel.
When the current decoder design scheme based on the GPU decodes large-scale data, a large amount of transmission delay is generated in data transmission between the GPU and the CPU, but most of current research works do not optimize the time, and the throughput rate of the decoder is improved, so that the data to be decoded and a check matrix are processed at the CPU end, and the transmission delay is reduced. The method specifically comprises the following steps: the parallel decoding system comprises a CPU; the CPU is used for carrying out quantization operation on the data to be decoded in the data set to be decoded and carrying out compression operation on the check matrix, and storing the quantized data to be decoded and the check matrix information into the page-locking memory chip area.
Example 3
Embodiment 3 of the present invention provides a specific implementation manner of the parallel decoding system when the parallel decoding system includes a CPU and a GPU.
Because the current main mode for improving the throughput rate of the LDPC decoder based on the GPU is only optimized aiming at partial characteristics of the decoder, the optimization of the decoder is not comprehensive enough, and various problems can be caused. Such as: the GPU performance cannot be fully exerted, and waste is caused to the performance; the scheduling structure of the adopted decoding algorithm can not be completely expanded in parallel, so that the parallelism in the code is limited; the way of improving the throughput rate is limited by the performance of the GPU; data transmission is not optimized, and the improvement of throughput rate is influenced; the decoding universality is lacked, and the optimization is only carried out aiming at general GPU equipment.
The parallel decoding system can fully utilize memory resources in the GPU, can fully and parallelly expand LDPC decoding operation, further improves the parallelism of the decoder without being limited by the performance of the GPU, can optimize data transmission between the GPU and the CPU, and can be applied to low-power-consumption embedded GPU equipment.
The parallel decoding method and the system provided by the invention are a universal high-throughput LDPC decoder, namely, the LDPC decoding operation with any code length and code rate can be realized on the premise of providing high throughput. While supporting the use of normalized min-sum (NMS) decoding algorithms or offset min-sum (OMS) decoding algorithms.
The parallel decoding method and the parallel decoding system provided by the invention have the advantages that the throughput rate is improved mainly through four points, firstly, the memory resources in the GPU are fully utilized, and the memory access efficiency during decoding is improved; secondly, the decoding algorithm adopts the Flooding scheduling, the LDPC decoding algorithm can be completely and parallelly expanded, and the parallelism in the code is maximum; thirdly, by utilizing the SIMD instruction set of the GPU, the inter-code parallelism of the decoder can be further expanded without being limited by the performance of the GPU; and fourthly, data transmission between the GPU and the CPU is optimized, and data transmission delay is reduced, so that the throughput rate is further improved.
The specific implementation steps of the parallel decoding method and system of the present invention are shown in fig. 1, the schematic diagram of the application distribution of the GPU side memory is shown in fig. 2, and the flow of decoding 4 codeword data at the GPU side is shown in fig. 3. M, N in fig. 2 and 3 are the number of rows and columns, respectively, of the check matrix H.
The parallel decoding system of the invention is mainly completed by seven steps: step one, data to be decoded is transmitted from a CPU end to a GPU end. And step two, sequencing the data to be decoded at the GPU side. And step three, decoding operation is executed at the GPU end, and the transmitted data to be decoded is decoded. And fourthly, executing hard decision operation at the GPU terminal, and obtaining a decoding result through hard decision. And fifthly, executing reordering operation on the decoding result at the GPU side. And sixthly, executing Bit compression operation at the GPU end, and compressing the decoding result after hard decision. And step seven, the decoding result after the Bit compression is transmitted back to the CPU from the GPU. The number of code words in the data to be decoded is set as code _ number, and when a check matrix H with M rows and N columns is adopted, the steps of realizing parallel decoding for the parallel decoding system are as follows:
the method comprises the following steps: transmitting data to be decoded from CPU to GPU
And transmitting the data to be decoded from the CPU end to the GPU end, wherein the specific operation of the step is as follows:
1. this step requires 8bit quantization of the data to be decoded.
The transmission data amount from the CPU end to the GPU end is only 1/4 of the 32bit quantization scheme after quantization. The specific steps of quantification are as follows:
(1) initializing an expansion coefficient Expand _ factor, a data upper limit Pos _ LLR and a data lower limit Neg _ LLR according to user requirements.
(2) Multiplying the data to be decoded by the expansion coefficient Expand _ factor, updating the data to be decoded, and rounding the updated data to be decoded.
(3) After rounding is finished, carrying out amplitude limiting on data to be decoded according to the upper limit Pos _ LLR and the lower limit Neg _ LLR.
2. After the decoded data are quantized, the check matrix H is compressed to obtain check matrix information Hc
The check matrix H of the LDPC code comprises a large number of '0' elements, and the elements do not participate in decoding operation, thereby causing waste of storage space. In the step, only non-0 elements in the check matrix H are stored, so that the waste of storage space is reduced. Taking the example that the base matrix in 5G adopts BG1 and the check matrix with the spreading factor of 384 as an example, after the compression method of the step, the occupied space is only 0.03% of the original space. The specific steps of the compression operation are shown in fig. 4, and the specific operation steps are as follows:
(1) the number of elements in the check matrix H that are not "0" in each Row, i.e., the Row weight Row _ default of each Row, is counted.
(2) Deforming the check matrix according to the Row weight Row _ degree of each Row of the check matrix H, and sequencing the rows of the check matrix from large to small according to the Row weight Row _ degree.
(3) Counting the positions of non-0 elements in each row after sorting, and then only storing the row weight and the position information of the non-0 elements in each row of the check matrix H, namely the check matrix information Hc
3. Data to be decoded and check matrix information H after quantization at CPU end paircAnd storing by adopting a page-locking memory mode.
The physical address of the paged memory is not changed after being allocated, so that the addressing operation does not need to be executed again during data transmission, and the transmission efficiency is improved. Tests show that after the page-locked memory is adopted, the transmission rate between the CPU and the GPU is improved by about 3 times.
The method comprises the following steps: calling the cuda _ chk _ alloc library, and storing the data to be decoded and the check matrix information into a paged memory chip area in the CPU.
4. Data to be decoded and check matrix information HcAnd transmitting the paged memory chip area from the CPU end to the GPU end through a PCIe bus.
Step two: executing sorting operation on data to be decoded at GPU (graphics processing Unit)
Because the SIMD instruction set of the GPU is introduced, the data needs to be sequenced after the GPU end receives the data to be decoded.
The step carries out the sorting operation in parallel through the designed OrderingKernel module, thereby reducing the sorting time. The number of threads of Ordering is coded _ number/4 × N. The Ordering time of the Ordering Kernel module is independent of the number of decoding iterations. If the number of iterations is large, the sort time may be ignored. Temporary variables generated during sorting are also stored in Registers of the corresponding threads.
The specific operation of the Ordering Kernel module is shown in fig. 5, and the specific operation steps are as follows:
1. firstly, grouping data to be decoded, and grouping the data to be decoded of every 4 code words. And the redundant data of less than 4 code words is expanded into a group of data by zero padding.
2. And extracting the grouped data to be decoded, and extracting the data at the same position in each group of 4 code words.
3. Because the data to be decoded is subjected to 8-bit quantization, 4 pieces of 8-bit data at the same position in each group are extracted and combined into 1 piece of 32-bit data.
4. And finally storing the data to be decoded in a 32-bit array, wherein subscripts of the array correspond to the positions of the data to be decoded in the code words.
Step three: performing decoding processing operation at GPU terminal
After finishing sequencing the data to be decoded transmitted to the GPU, starting to execute decoding operation, wherein the decoding operation comprises the following specific steps:
1. an initialization operation is first performed on variables required for decoding.
In order to accelerate the speed of the initialization operation, an LLR _ IntKernel module is designed at the GPU end to complete the initialization operation in parallel. The number of threads of the LLR _ Init Kernel module is coded _ number/4N, the threads in the module correspond to each VN node, and VN node information is obtained according to data to be decoded
Figure BDA0003340615400000151
Initialization is performed. Due to the data to be decoded
Figure BDA0003340615400000152
It does not change during operation, and can be used as read-only variable, but adjacent thread will be adjacent to adjacent position during decoding
Figure BDA0003340615400000153
Data is read, so that the memory access times can be reduced by adopting Sharedmemory storage, and the memory access efficiency is improved.
2. The iterative decoding is started with the number of decoding iterations i set to 0.
3. And carrying out correlation operation of the Check Node (CN).
The CN operation in this step is completed by a CN _ computer Kernel module designed at the GPU side. The number of the threads of the CN _ computer Kernel module is coded _ number/4M. CN _ computer Kernel can be in the form of normalized min-sum (OMS) or offset min-sum (NMS). If the LDPC decoding algorithm is set as the OMS algorithm, the calculation formula is as follows:
Figure BDA0003340615400000161
Figure BDA0003340615400000162
if the LDPC decoding algorithm is set as the NMS algorithm, the calculation formula (2) needs to be changed into:
Figure BDA0003340615400000163
wherein N is the column number of the check matrix H, i is the current decoding iteration number, the parameter alpha is the NMS algorithm normalization parameter, and the parameter beta is the offset parameter of the OMS algorithm. CmnDelivery information (C2V), V, for the mth CN to the nth VN nodenmDelivery information (V2C), sign, for the nth VN node to the mth CN nodec2vIs CmnThe sign bit values of min1 and min2 are respectively in the transfer information for all C2VThe minimum value and the next smallest value of (c).
Information V is transmitted due to V2C in iterative decodingnmAnd C2V conveying information CmnThe Memory access method is frequently used by different threads, so that the Memory access efficiency can be further improved by adopting Shared Memory storage. At the same time due to signc2vThe min1 and min2 variables are temporary variables that are generated during operation, and thus are placed into the cache Registers of the corresponding thread.
When the code _ number is greater than or equal to 4, the step can process a plurality of code blocks in parallel through the SIMD instruction set, and the specific instruction operation is as follows:
calculating the value of the C2V transfer information
Figure BDA0003340615400000164
V2C routing information for reading all VN nodes connected to corresponding CN node
Figure BDA0003340615400000165
By xor instruction to all but n ═ m
Figure BDA0003340615400000166
The sign bit of (a) performs a multiply operation by an xor instruction. Using min instruction to record simultaneously during operation
Figure BDA0003340615400000167
Minimum min1 and next minimum min 2. Finally, calculating by combining the xor instruction, the sub instruction and the abs instruction
Figure BDA0003340615400000168
The value of (c).
4. Correlation operation of Variable Nodes (VN) is performed.
The VN operation of this step is performed in parallel with the VN correlation operation by the VN _ computer Kernel module designed at the GPU side. The number of the threads of VN _ computer Kernel is code _ number/4N. The specific calculation formula of each thread in VN _ computer Kernel is as follows:
Figure BDA0003340615400000171
Figure BDA0003340615400000172
Figure BDA0003340615400000173
where M is the number of rows of the check matrix H, VnmFor calculating the V2C transfer information for the nth VN to the mth CN, sum is the C2V transfer information C for the CN connected to the VNmnThe sum of the total weight of the components,
Figure BDA0003340615400000174
for the aposteriori probability (APP) data after each decoding iteration,
Figure BDA0003340615400000175
is the original VN information data.
The sum variable is stored in the cache Registers of the corresponding thread.
When the code _ number is greater than or equal to 4, the step can process a plurality of code blocks in parallel through the SIMD instruction set, and the specific instruction operation is as follows:
calculating the value of V2C transfer information
Figure BDA0003340615400000176
The C2V transfer information of all CN nodes connected with the corresponding VN node is firstly calculated through add instruction
Figure BDA0003340615400000177
Sum of the values sum. Then completing V2N according to sum value through sub instruction
Figure BDA0003340615400000178
And calculating values and limiting the data through max and min instructions.
5. And adding 1 to the decoding iteration number and judging, if the decoding iteration number does not exceed the maximum decoding iteration number Max _ iter _ number, restarting to execute the operation from the step 3, and otherwise stopping decoding and entering the next step.
Step four: executing hard decision operation at GPU end to obtain decoding result
Requiring VN nodes after reaching the maximum number of iterations
Figure BDA0003340615400000179
The value is hard-judged to obtain a decoding result En. In the step, a Hard _ decision Kernel module is designed to perform parallel processing on Hard decisions. The route number of the Hard _ precision Kernel is coded _ number/4 × N. When in use
Figure BDA00033406154000001710
The judgment result is 1 when the judgment result is greater than 0, and the judgment result is 0 when the judgment result is less than 0.
The hard decision is specifically formulated as follows:
Figure BDA00033406154000001711
when the code _ number is greater than or equal to 4, the parallelization processing mode is performed in combination with the SIMD instruction set in the step, and the specific instruction operation is as follows:
the hard decision operation is completed by calling a min instruction, and when the hard decision operation is greater than 0, the decision result is 1, and when the hard decision operation is less than 0, the decision result is 0.
Step five: reordering decoding results at GPU end
Since the data is sorted before being decoded, the data needs to be reordered and put back to the original position after the decoding is finished.
In the reordering operation of the step, reordering is performed by a reordering Kernel module designed at the GPU end, so that reordering time is reduced. The number of threads of the recording Kernel is coded _ number/8 × N. ReoderingKernel is performed only once after the end of decoding, and the ordering time is independent of the decoding iteration. If the number of iterations is large, the sort time may be ignored. The specific operation steps are shown in fig. 5, and the specific operation steps are as follows:
1. and splitting a decoding result. Each 32-bit data is split into 4 8-bit data.
2. And after splitting, putting the split data back to the corresponding positions of the 4 code words according to the subscript of the 32bit data array.
Step six: performing Bit compression processing on the decoding result at the GPU end
After the reordering is finished, the decoding result is subjected to Bit compression processing in the step.
In the Bit compression operation of the step, a Bit _ packet Kernel module designed at the GPU end is used for compressing data. The thread number of the Bit _ packedKernel module is coded _ number/8N, and the final decoding result is subjected to data compression in the step, so that the occupation of a storage space is reduced. Because the decoding results obtained by hard decision after the decoding iteration is finished are all '0' or '1', each data can only occupy 1bit of space through binary representation. This step merges every 8 data into 1 byte of data by shift operation, and only passes "0" and "1" information back to the CPU. After compression, the storage space of the decoding result becomes 1/8 as it is. The Bit compression processing is completed through a Bit-PackedKernel at the GPU end, and each thread in the Bit-PackedKernel processes 8 decoding result data. The APP _ Bit data is a final decoding result obtained by Bit compression, the data volume is small, and in order to improve the access efficiency, Shared Memory storage is adopted. The specific operation of each thread in the Bit-PackedKernel is as follows:
1. firstly, 8 adjacent decoding result data are extracted.
2. The 8 data are merged into 1 byte of data by a shift operation.
3. And storing the merged data into Sharedmemory.
Step seven: and returning the APP _ Bit data after Bit compression from a Sharedmemory memory area at the GPU end to a memory at the CPU end.
Meanwhile, the invention adopts an 8-Bit quantization mode and compresses the decoding result in a Bit compression mode, so that the data volume transmitted from the GPU back to the CPU is further compressed, and the transmission time is changed into 1/32 of a 32-Bit quantization mode scheme.
Example 4
At present, the main mode for improving the throughput rate of the LDPC decoder based on the GPU is only optimized aiming at partial characteristics of the decoder, but various problems can be caused due to incomplete optimization of the decoder. Such as: the GPU performance cannot be fully exerted, and waste is caused to the performance; the scheduling structure of the adopted decoding algorithm cannot be completely expanded in parallel, and the parallelism is limited; the way of improving the throughput rate is limited by the performance of the GPU; data transmission is not optimized, and the improvement of throughput rate is influenced; the decoding universality is lacked, and the optimization is only carried out aiming at general GPU equipment.
The GPU general high-throughput LDPC decoder designed by the invention is tested on a general RTX2080Ti platform, and the throughput can reach 8.67Gb/s at most when 10 times of iterative decoding are carried out. Meanwhile, in order to verify the decoding universality of the LDPC decoder, the test is carried out on a low-power-consumption embedded GPU device Jetson Xavier NX, and the throughput rate can reach 1.14Gb/s under 10 iterations. The verification flow chart of the test is shown in fig. 6.
The parallel decoding method and the parallel decoding system provided by the invention can realize the LDPC code decoding operation with any code length and code rate on the premise of providing high throughput rate. While supporting the use of normalized min-sum (NMS) decoding algorithms or offset min-sum (OMS) decoding algorithms. The concrete lifting has the following points:
(1) the invention adopts the decoding scheduling algorithm of the most suitable merging processing, can fully and parallelly expand the decoding algorithm during design, furthest improves the intra-code parallelism of the decoder, fully exerts the performance of the GPU and furthest improves the throughput rate of the LDPC decoder.
(2) The invention fully optimizes the memory allocation strategy of the decoder at the GPU end, and improves the memory access efficiency during decoding. For example: the check matrix H is compressed, taking the example that BG1 is adopted as the base matrix in 5G and the check matrix with the spreading factor of 384 is adopted, and the storage space occupied by the compression method in this document is only 0.03% of the original storage space.
(3) The design method provided by the invention introduces the SIMD instruction set in the GPU, and improves the intersymbol parallelism without being influenced by the performance of the GPU. The invention adopts 8-bit fixed point quantization processing data, fully utilizes the SIMD instruction set with 32-bit width in the GPU, and expands the parallelism between blocks to 4 times of the original parallelism.
(4) The design scheme provided by the invention optimizes the data transmission time between the GPU and the CPU during decoding, and reduces the data transmission delay. The invention adopts an 8-bit quantization processing mode, so that the transmission time from the CPU to the GPU is changed into 1/4 originally. The invention performs a Bit compression operation on the decoded result, and the transmission time from the GPU to the CPU is changed to 1/8 as it is.
(5) The design scheme provided by the invention can also achieve high throughput performance on low-power-consumption embedded GPU equipment, and has good decoding universality.
The high-throughput LDPC decoding method and system based on the GPU fully utilize the memory resources in the GPU and improve the memory access efficiency during decoding; the LDPC decoding algorithm is completely and parallelly expanded, so that the parallelism in the code is maximum; the SIMD instruction set of the GPU is combined, so that the inter-code parallelism of the decoder can be further expanded without being limited by the performance of the GPU; data transmission between the GPU and the CPU is optimized, and data transmission delay is reduced, so that the throughput rate is further improved; the method has good decoding universality and can be suitable for low-power-consumption embedded GPU equipment.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (10)

1. A parallel decoding method, characterized in that the parallel decoding method comprises the steps of:
merging the data to be decoded in the data set to be decoded into 32-bit data by adopting an SIMD instruction set, sequencing, and storing a sequencing result in a 32-bit array; the 32-bit array uses subscript to represent the position of data to be decoded corresponding to the 32-bit data in the 32-bit array in the code word;
based on a flood scheduling decoding algorithm and a SIMD instruction set, parallel computing transmission information transmitted to different variable nodes by each check node;
based on a flood scheduling decoding algorithm and a SIMD instruction set, parallel computing transmission information transmitted from each variable node to different check nodes;
returning to the step of calculating the transmission information transmitted to different variable nodes by each check node in parallel based on the flood scheduling decoding algorithm and the SIMD instruction set until the iteration number reaches the threshold of the iteration number, and outputting the transmission information transmitted to the different variable nodes by each check node and the transmission information transmitted to the different check nodes by each variable node after the last iteration;
according to the transmission information transmitted to different variable nodes by each check node after the last iteration and the transmission information transmitted to different check nodes by each variable node, the posterior probability of each check node is calculated in parallel;
carrying out hard decision operation on the posterior probability of each variable node in parallel to obtain a decoding result;
and splitting and reordering the decoding result corresponding to each 32-bit array according to the subscript of each 32-bit array to obtain the decoding result corresponding to each data to be decoded.
2. The parallel decoding method according to claim 1, wherein the merging the data to be decoded in the data set to be decoded into 32-bit data by using the SIMD instruction set, and performing the sorting, wherein the sorting result is stored in a 32-bit array, specifically comprising:
grouping 8-bit data to be decoded in a data set to be decoded in a mode of grouping 4 code words; if the last group is less than 4 code words, the last group is expanded into a group in a zero filling mode;
adopting a plurality of threads to execute the following operations in parallel on the data to be decoded of each group:
extracting data at the same position in the data to be decoded in each group of 4 code words and merging the data into 32-bit data;
and storing the 32-bit data of each group of 4 code words in a 32-bit array, wherein the subscript of the 32-bit array corresponds to the position of the data to be decoded in the code words.
3. The parallel decoding method according to claim 1, wherein the parallel computing of the transfer information transferred from each check node to different variable nodes based on the Flooding scheduling decoding algorithm and the SIMD instruction set specifically comprises:
using formulas
Figure FDA0003340615390000021
Parallelly calculating the symbol bit value of the transmission information transmitted to different variable nodes by each check node;
according to the symbol bit value of the transmission information transmitted to different variable nodes by each check node, using a formula
Figure FDA0003340615390000022
Parallelly calculating the transmission information transmitted to different variable nodes by each check node;
wherein the content of the first and second substances,
Figure FDA0003340615390000023
a sign bit value representing transfer information transferred from the mth check node to the nth variable node in the ith iteration,
Figure FDA0003340615390000024
representing the transfer information from the nth' variable node to the mth check node in the (i-1) th iteration, N representing the number of variable nodes,
Figure FDA0003340615390000025
representing the transfer information transferred from the mth check node to the nth variable node in the ith iteration, min1 and min2 respectively representing the minimum value and the second minimum value in the transfer information transferred from all check nodes to the variable nodes obtained before the current iteration, and β representing the offset parameter of the OMS algorithm.
4. The parallel decoding method according to claim 1, wherein the parallel computing of the transfer information transferred from each check node to different variable nodes based on the Flooding scheduling decoding algorithm and the SIMD instruction set specifically comprises:
using formulas
Figure FDA0003340615390000026
Parallelly calculating the symbol bit value of the transmission information transmitted to different variable nodes by each check node;
according to the symbol bit value of the transmission information transmitted to different variable nodes by each check node, using a formula
Figure FDA0003340615390000027
Parallelly calculating the symbol bit value of the transmission information transmitted to different variable nodes by each check node;
wherein the content of the first and second substances,
Figure FDA0003340615390000028
a sign bit value representing transfer information transferred from the mth check node to the nth variable node in the ith iteration,
Figure FDA0003340615390000031
representing the transfer information of the nth' variable node transferred to the mth check node in the (i-1) th iteration, N representing the column number of the check matrix, min1 and min2 representing the minimum value and the second minimum value of the transfer information of all check nodes transferred to the variable node obtained before the current iteration respectively, and alpha representing NMThe S algorithm normalizes the parameters of the algorithm,
Figure FDA0003340615390000032
representing the transfer information passed from the mth check node to the nth variable node in the ith iteration,
Figure FDA0003340615390000033
and (4) showing.
5. The parallel decoding method according to claim 1, wherein the parallel computing of the transfer information transferred from each variable node to different check nodes based on the Flooding scheduling decoding algorithm and the SIMD instruction set specifically comprises:
using formulas
Figure FDA0003340615390000034
The sum of the transmission information transmitted to each variable node by different check nodes is calculated in parallel;
using a formula based on the sum of the transfer information transferred from different check nodes to each variable node
Figure FDA0003340615390000035
Parallelly calculating the transmission information transmitted from each variable node to different check nodes;
wherein the content of the first and second substances,
Figure FDA0003340615390000036
representing the sum of the passed information passed by the different check nodes to the nth variable node in the ith iteration,
Figure FDA0003340615390000037
represents the transfer information of the mth check node to the nth variable node, M represents the number of rows of the check matrix,
Figure FDA0003340615390000038
indicating that the nth variable node passes to the nth variable node in i iterationsThe delivery information of the m check nodes is,
Figure FDA0003340615390000039
initial data information representing the nth variable node.
6. The parallel decoding method according to claim 1, wherein the parallel computing of the a posteriori probability of each check node according to the transfer information of each check node to different variable nodes and the transfer information of each variable node to different check nodes after the last iteration specifically comprises:
according to the transmission information transmitted to different variable nodes by each check node after the last iteration and the transmission information transmitted to different check nodes by each variable node, a formula is utilized
Figure FDA00033406153900000310
Calculating the posterior probability of each variable node in parallel;
wherein the content of the first and second substances,
Figure FDA00033406153900000311
represents the posterior probability of the nth variable node,
Figure FDA00033406153900000312
the transfer information of the nth variable node to the mth check node after the last iteration is shown,
Figure FDA00033406153900000313
and representing the transmission information transmitted from the mth check node to the nth variable node when m is equal to n after the last iteration, and max represents the threshold value of the iteration number.
7. The parallel decoding method according to claim 1, wherein the parallel hard decision operation is performed on the a posteriori probability of each variable node to obtain a decoding result, and specifically comprises:
respectively using formulas
Figure FDA0003340615390000041
Carrying out hard decision operation on the posterior probability of each variable node in parallel to obtain a decoding result;
wherein the content of the first and second substances,
Figure FDA0003340615390000042
representing the posterior probability of the nth variable node, EnAnd (4) representing the decoding result of the nth variable node.
8. The parallel decoding method according to claim 1, wherein the splitting and reordering of the decoding result corresponding to each 32-bit array according to the subscript of each 32-bit array to obtain the decoding result corresponding to each data to be decoded, and thereafter further comprising:
and carrying out bit compression on the decoding result corresponding to each data to be decoded, and storing the compressed decoding result into a Sharedmemory.
9. A parallel coding system, comprising a GPU, the GPU comprising:
the Ordering Kernel module comprises a code _ number/4N threads, and each thread in the Ordering Kernel module executes in parallel, namely, the data to be decoded in a data set to be decoded are combined into 32-bit data by adopting a SIMD instruction set, and are sequenced, and sequencing results are stored in a 32-bit array; the 32-bit array uses subscript to represent the position of data to be decoded corresponding to the 32-bit data in the 32-bit array in the code word; the code _ number represents the number of code words in a data set to be decoded, and N represents the number of columns of a check matrix;
the CN _ computer Kernel module comprises a code _ number/4M threads, and each thread in the CN _ computer Kernel module executes the step of calculating the transmission information transmitted from each check node to different variable nodes in parallel, wherein M represents the number of rows of the check matrix;
a VN _ computer Kernel module, wherein the VN _ computer Kernel module comprises a code _ number/4N threads, and each thread in the VN _ computer Kernel module executes the step of calculating the transmission information transmitted from each variable node to different check nodes in parallel;
each thread in the VN _ computer Kernel module also executes in parallel, and when the iteration number reaches an iteration number threshold, the posterior probability of each check node is calculated according to the transmission information transmitted to different variable nodes by each check node after the last iteration and the transmission information transmitted to different check nodes by each variable node;
a Hard _ decision Kernel module, wherein the Hard _ decision Kernel module comprises a core _ number/4N threads, and each thread in the Hard _ decision Kernel module executes a Hard decision operation on the posterior probability of each variable node in parallel to obtain a decoding result;
the reordering Kernel module comprises code _ number/8N threads, and the reordering Kernel module respectively executes the steps of splitting and reordering decoding results corresponding to the 32bit array according to subscripts of the 32bit array to obtain decoding results corresponding to data to be decoded in parallel.
10. The parallel decoding system of claim 9, wherein the parallel decoding system comprises a CPU;
the CPU is used for carrying out quantization operation on the data to be decoded in the data set to be decoded and carrying out compression operation on the check matrix, and storing the quantized data to be decoded and the check matrix information into the page-locking memory chip area.
CN202111306835.6A 2021-11-05 2021-11-05 Parallel decoding method and system Pending CN114006621A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111306835.6A CN114006621A (en) 2021-11-05 2021-11-05 Parallel decoding method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111306835.6A CN114006621A (en) 2021-11-05 2021-11-05 Parallel decoding method and system

Publications (1)

Publication Number Publication Date
CN114006621A true CN114006621A (en) 2022-02-01

Family

ID=79927913

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111306835.6A Pending CN114006621A (en) 2021-11-05 2021-11-05 Parallel decoding method and system

Country Status (1)

Country Link
CN (1) CN114006621A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013117076A1 (en) * 2012-02-07 2013-08-15 中兴通讯股份有限公司 Method and system for iterative decoding
US20170257388A1 (en) * 2016-01-06 2017-09-07 New York University System, method and computer-accessible medium for network intrusion detection
CN108183713A (en) * 2017-12-15 2018-06-19 南京大学 Ldpc decoder and its interpretation method based on modified minimum-sum algorithm

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013117076A1 (en) * 2012-02-07 2013-08-15 中兴通讯股份有限公司 Method and system for iterative decoding
US20170257388A1 (en) * 2016-01-06 2017-09-07 New York University System, method and computer-accessible medium for network intrusion detection
CN108183713A (en) * 2017-12-15 2018-06-19 南京大学 Ldpc decoder and its interpretation method based on modified minimum-sum algorithm

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
刘有耀: "基于SIMD体系结构的指令级并行结构设计", 电子设计工程, no. 2017, 8 December 2017 (2017-12-08), pages 152 - 156 *
周健;吕毅博;洪少华;王琳;: "面向磁记录信道的原模图LDPC码译码器的FPGA设计", 重庆邮电大学学报(自然科学版), no. 06, 15 December 2013 (2013-12-15) *
夏高;刘斌;: "用于高速网络入侵检测系统的并行TCP/IP协议栈", 清华大学学报(自然科学版), no. 07, 15 July 2011 (2011-07-15) *

Similar Documents

Publication Publication Date Title
CN109379086B (en) Low-complexity code rate compatible 5G LDPC coding method and encoder
JP4320418B2 (en) Decoding device and receiving device
CN102545913B (en) Iterative decoding method and iterative decoding system
Yuan et al. Low-latency successive-cancellation list decoders for polar codes with multibit decision
KR100846869B1 (en) Apparatus for Decoding LDPC with Low Computational Complexity Algorithms and Method Thereof
US7373581B2 (en) Device, program, and method for decoding LDPC codes
KR101203340B1 (en) Turbo ldpc decoding
Murugappa et al. A flexible high throughput multi-ASIP architecture for LDPC and turbo decoding
CN1898874A (en) Siso decoder with sub-block processing and sub-block based stopping criterion
CN110233628B (en) Self-adaptive belief propagation list decoding method for polarization code
CN111105007A (en) Compression acceleration method of deep convolutional neural network for target detection
US20110179337A1 (en) Memory utilization method for low density parity check code, low density parity check code decoding method and decoding apparatus thereof
CN104092470A (en) Turbo code coding device and method
CN101594152B (en) LDPC code decoding method for realizing simultaneous operation of horizontal operation and vertical operation
CN107872231B (en) LDPC decoding method and device
CN114006621A (en) Parallel decoding method and system
CN116707546A (en) Hardware implementation method and device for quasi-cyclic LDPC decoding
CN113381769B (en) Decoder based on FPGA
CN116192157A (en) Implementation method for reducing QC-LDPC code generation matrix density
CN115694513A (en) Ultra-high throughput rate LDPC decoder based on shift-type base graph
CN108988873B (en) Polar code processing method, decoder and terminal
CN1540871A (en) LDPC iteration encoding Method based on improved Taneer graph
CN112104379B (en) Polarization code confidence propagation dynamic overturning decoding method based on key set
CN111431543B (en) Variable code length and variable code rate QC-LDPC decoding method and device
KR20090065411A (en) Apparatus and method for preprocessing used in decoding with group unit

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination