CN115118289B

CN115118289B - Coding method of 5GLDPC encoder based on GPU

Info

Publication number: CN115118289B
Application number: CN202211037856.7A
Authority: CN
Inventors: 刘荣科; 李岩松; 田铠瑞; 王若诗
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2022-08-29
Filing date: 2022-08-29
Publication date: 2022-11-18
Anticipated expiration: 2042-08-29
Also published as: CN115118289A

Abstract

The invention provides a coding method of a 5GLDPC coder based on a GPU, which comprises the following steps: 1: initializing a storage space of a host end; 2: initializing a storage space of a GPU (graphics processing unit) device end; 3: initializing LDPC check matrix information of a GPU device end; 4: copying coding indication information and data information from a host to a GPU; 5: calling a GPU function by a host side to perform LDPC coding; 6: the GPU carries out preprocessing on the received information bits; 7: the GPU carries the processed data information to a high-speed on-chip internal memory; 8: the GPU carries out parallel coding on the LDPC; 9: the GPU compresses and calculates the encoded parity check bits; 10: transmitting the compressed information back to the host end; 11: and the host end splices the received compression check information and the information bits to form complete LDPC coding information. The invention can improve the throughput of the encoder, and enables the encoder to have the characteristics of flexibility and high speed.

Description

Coding method of 5GLDPC encoder based on GPU

Technical Field

The invention belongs to the technical field of communication, and relates to a coding method of a 5GLDPC (Graphics Processing Unit) encoder based on a Graphics Processing Unit (GPU).

Background

Recently, low density parity check codes (LDPC) play an important role in 5G communications and have been selected as a coding scheme for enhanced mobile broadband (eMBB) data channels, using quasi-cyclic low density parity check codes (QC-LDPC) for coding. Because the 5G QC-LDPC has various check matrixes and lifting values, LDPC code words in different forms can be combined, and the design difficulty of the encoder is extremely high. Considering the air interface resource allocation of the 5G mobile communication system, the uplink and downlink rates generally have a corresponding relationship, so the LDPC encoder of the base station must complete a large number of codeword encoding within a specified time slot, and the encoding process faces a huge delay challenge, and therefore, designing a 5GLDPC encoder with low delay, high throughput, and low complexity is a difficult problem to be solved.

Nguyen et al propose a new efficient coding method and a high throughput low complexity encoder architecture, with significantly reduced chip area and memory space consumption. By storing the quantized values of the permutation information for each sub-matrix instead of the entire parity check matrix, the required memory storage is reduced while maintaining high throughput. (see reference [1]: NGUYEN T T B, NGUYEN TAN T, LEE H. Effective QC-LDPC Encoder for 5G New radio [ J ]. Electronics, 2019, 8 (6): 668.). Tian et al studied the parallel design and implementation of QC-LDPC encoders, using a multi-channel parallel structure to obtain multiple parity check bits, thereby significantly reducing coding delay. And the high-parallelism coding algorithm is mapped to a configurable circuit structure, and all 5G NR code lengths and code rates can be supported. (see reference [2]: tian Y, bai Y, liu D. Low-Latency QC-LDPC Encoder Design for 5G NR [ J ]. Sensors (Basel, switzerland), 2021, 21 (18): 6266.) Liao et al propose a LDPC coding method based on GPU. The LDPC coding with the code rate of 1/2 to 8/9 realizes higher throughput on a single GPU, and experiments show that parallel simulation tasks based on the GPU can achieve good balance between performance and cost. ( See reference [3]: s, liao, Y, zhan, Z, shi and L, yang, A High through High and Flexible Rate 5G NR LDPC Encoder on a Single GPU [ C ]//2021 23rd International Conference on Advanced Communication Technology (ICACT), 2021, pp. -34. )

In recent years, related experts and researchers have made a lot of attempts to use a 5G LDPC code high-speed encoder based on a dedicated hardware platform (ASIC, FPGA) and a general processing platform, and the LDPC encoder based on dedicated hardware can obtain lower delay and higher energy efficiency, and in consideration of the encoding requirement of a base station for measuring a large number of code blocks, a large number of dedicated chips need to be deployed in batch to meet the high throughput performance requirement. Therefore, the development period is long, the operation and maintenance difficulty is high, and the development requirement of the future communication system for task diversification is difficult to meet. The LDPC encoder designed based on the GPU general processing platform is realized through software design, and deployment and parameter configuration are flexibly performed by adopting upgrading on a program level, but the LDPC encoding algorithm based on the GPU at the present stage does not fully utilize GPU computing resources, and a parallel processing design method of 5GLDPC encoding still needs to be improved.

Disclosure of Invention

The invention provides a coding method of a 5GLDPC coder based on a GPU, which fully utilizes GPU computing resources to carry out large-scale parallel processing on LDPC codes, further improves the throughput of the coder, supports parallel processing of LDPC codes of different code types and enables the coder to have the characteristics of flexibility and high speed.

The invention firstly provides a 5GLDPC high-speed encoder based on a GPU, and the encoder structure mainly comprises a host end and a GPU equipment end; the method comprises the steps that a host memory and a CPU chip are arranged at a host end, and the CPU chip is used for preprocessing coded information, combining code words, controlling and scheduling the whole coding process and sending an information stream to be coded to a GPU (graphic processing unit) device end; the GPU equipment end is provided with a GPU chip, the GPU chip consists of a plurality of Stream Multiprocessors (SM), each SM is responsible for coding a group of LDPC code blocks with different code types, and a large number of threads started in the processors carry out high-speed and parallel coding on information streams; the host side and the GPU equipment side carry out data transmission through a high-speed serial computer expansion bus (PCI-E).

The invention also provides a coding method of the 5GLDPC coder based on the GPU, the whole coding process can be divided into 11 steps, and based on the coder, the specific coding steps are as follows:

step 1: and initializing the storage space of the host side.

Allocating enough storage space for information bits and coded bits at a host end according to the maximum code block number num _ C processed simultaneously;

step 2: and initializing a storage space of the GPU device.

The GPU equipment side configures a memory space, allocates enough global memory space for information bits and coded bits, and allocates memory space on a high-speed chip to the GPU according to the maximum occupied resource of the current coding;

and step 3: and initializing LDPC check matrix information of the GPU equipment terminal.

At GPU equipment end, all LDPC base matrixes H specified by 5G protocol are used _b And pre-storing the information, and writing the offset information H _ shift in the base matrix and the position H _ offset of the row where the circulating block in each row is located into a global memory of the GPU. Basis matrix H _b Comprises a sub-matrixA、B、C、D、I；

And 4, step 4: the host copies the coding indication information and the data information to the GPU.

The host end writes the coding information of each code block into a structural body, wherein the structural body comprises a basic graph BG, a lifting value Zc and the number of information bit groupsk _b Check matrix type B _ type, number of coding layersm _b And a corrected value remainder. The method comprises the steps that information data and coding bits are transmitted between a host end and an equipment end in a compression arrangement mode, the structure body further comprises a starting position src _ offset of each code block information bit and a bit storage starting position dst _ offset after coding is completed, coding indication information structure bodies of a plurality of code blocks form a coding indication information structure body array, and the host end transmits all indication information structure body arrays waiting for parallel processing of the code blocks to a GPU global memory. And the host end performs bit packing on the information bits of each code block according to 32 bits, then compresses and arranges the packed bits of all the code blocks, and copies the information bit data to be coded to a GPU global memory.

And 5: and calling a GPU function to carry out LDPC coding at the host side.

The method comprises the steps that the number of blocks grouped by a GPU virtual processor is equal to the maximum code block number num _ C, the block dimensionality is one-dimensional, and a plurality of blocks process LDPC code word codes of different code types in parallel. The host end sets the number of the starting threads of each block to be 256, and the dimension of the threads is one dimension.

And 6: and the GPU carries out preprocessing on the received information bits according to the coding indication information.

The method comprises the steps that the thread of each block of the GPU obtains the coding indication information of the corresponding code block from the corresponding structure body array, wherein the coding indication information comprises a basic graph BG, a lifting value Zc and the number of information bit groupsk _b Checking matrix type B _ type, correction value remainder, start position src _ offset of information bit, bit storage start position dst _ offset after coding is finished, and then according to start position src _ offset of information bit and information bit group quantityk _b And lifting value Zc to obtain information bitsStarting from the position indicated by src _ offset, to ceil (Zc;)k _b The/32) × 4 bytes of space contains the information bits of the code block, ceil (, denotes rounding up.sRepresents a system sub-matrixACorresponding to the information bits received by the encoder, it is divided intok _b A sub group of information bitss _i ,(i= 1, 2, …, k _b ) Each group corresponds to Zc bits. GPU to information bitsEach information bit subgroup ofs _i And circularly filling to a multiple of 32 according to the corrected value remainder, wherein each subgroup corresponds to one circular block of the base matrix.

And 7: and the GPU carries the processed data information to the high-speed on-chip internal memory.

And 8: the GPU performs parallel encoding on the LDPC. The encoding of the LDPC code is performed using a check matrix H by solving equation Hc ^T =0 decoding of code wordscWhereinc ^T Representing a code wordcThe transposing of (1). The encoding stage comprises the following 3 steps:

step 8.1: grouping threads, establishing mapping relation between threads and information processing, and calculating by efficient shift operationk _b A sub group of information bitss _i Result of multiplication of corresponding bit with check matrix

The result will be calculatedm _b The group elements are cached in the high-speed on-chip memory in preparation for the calculation of the parity bits in the following steps.

Step 8.2: calculating according to the coding indication informationFirst part of parity check bitsp _a 。

First part parity check bitsp _a Comprising 4 groups of check bits

(z=1,2, 3, 4), so the first 4 layers of equations in the check matrix H need to be solved, the corresponding equation sets are listed according to the type B _ type of the check matrix H, and the information bit groups in step 8.1 are grouped intos _i Multiplying the corresponding bit by check matrix, solving equation set according to the multiplication result, reading the first 4 sets of results of cache vector in the high-speed on-chip memory, and combining the equation to obtain the first part of check bitp _a 。

Step 8.3: according to step 8.2p _a Computing second partial parity bitsp _c 。

The first part of parity bits is calculated in the manner in step 8.1p _a Result of multiplication with check matrix H

，

Then will be

And

the corresponding groups are modulo-two added due to the second part of the parity bitsp _c The corresponding check matrix is a sub-matrixIThe addition result is the second part of parity check bitsp _c And (6) obtaining the result.

And step 9: the GPU compresses the encoded parity check bits and calculates the parity check bitspIn common withm _b Groups, each group needs to delete the redundant bit according to the corrected value remainder to obtain Zc groups of each groupAnd the valid bit is compressed, and the compressed check bit result is written into the global memory according to the bit storage initial position dst _ offset.

Step 10: and transmitting the compression information in the global memory back to the host side.

Step 11: and the host end splices the received compression check information and the information bits to form complete LDPC coding information.

Wherein, the LDPC base matrix H of step 3 _b The information includes: the 5G NR standard specifies two basis matrices, BG1 and BG2, which have the structure shown in formula (1) and the size ofm _b ×n _b ，m _b Represents the number of coding layers, i.e. the number of rows of the base matrix,n _b represents the number of columns of the base matrix,n _b =m _b +k _b 。

（1）

basis matrix H _b The upper right 0 region indicates that all 0 s are in the part, whereAIs 4 ink _b ，BThe dimension of (a) is 4 x 4,Chas the dimensions ofm _b −4)×k _b ，DHas the dimensions ofm _b −4)×4，IIs of dimension (a)m _b −4)×(m _b -4). Basis matrix H _b The offset H _ shift corresponding to all the cyclic blocks is included, the position H _ offset of the row of each cyclic block in a row is obtained through calculation, and all the basic information structures are all transmitted into a GPU memory before decoding.

Wherein, the check matrix type B _ type of step 4 includes:B4 x 4, which are 4 different types, the 5G protocol specifies fixed B _ types for different BG patterns.

Wherein, the correction value remainder in the step 4 comprises: setting different correction values remainder according to the magnitude of the lifting value Zc, according to Zc and 32The remainder determines the magnitude of the corrected value remainder, when Zc

Corrected value remainder = Zc%32; when Zc<And a corrected value remainder = -Zc%32 at 32.remainder =0 represents that the information bits can be packed exactly into a set of unsigned 32-bit integer variables.

Wherein, the compression arrangement mode of step 4 includes: the PCI-E transmission speed is limited, when a group of different LDPC code blocks are coded, a fixed step length is selected, and the LDPC code (ZC) with a smaller lifting value ZC is coded

256 Needs to be filled with a large amount of redundant data 0 during encoding, resulting in a long transmission time, so data is transmitted in a way of compressing data amount, and a set of offset information of the start position of the code block relative to the start position of the first code block is generated by the host end to indicate the position of each code block in a set of data, the offset of each code block relative to the start position of the first code block is the sum of the effective lengths of all the previous code blocks, and compared with a fixed step size, for a lifting value Zc=2, transmission can save the transmission data amount by 192 times at most.

Wherein, the bit packing of step 4 includes: because the information data is bits of 0 or 1, the information bits are packed into a group according to every 32 bits, GPU resources can be fully utilized, each thread in the GPU controls 32 bits to perform parallel computation, and the parallel computation efficiency is greatly improved.

Wherein the cyclic filling of step 6 comprises: for the sub-group of information bits according to the size of the correction value remainders _i The middle Zc bit data is filled, because the memory size of a register in the GPU is 32 bits (4 bytes) and needs to be filled to integral multiple of 32, the information data stored by using 32 bits as basic symbols is formed, and therefore, each time, each Zc bit data is stored by using 32 bits as basic symbolss _i Occupying M registers, M = ceil (Zc/32), the registers being denoted asR _t ，t=1,2, …, M, the actual useful information data only covers M registersThe first Zc bits in the memory. Specifically, when remainder =0, zc is an integer multiple of 32, and the valid information of each group fills all M registers without filling any data; when remainders>When 0, it means that Zc is not an integer multiple of 32 and Zc>32，s _i Does not fill up M registers, eachs _i Starting to fill the same group after Zc bits actually occupieds _i The header data bit is filled until the data in the last register is completed. When remainders<When 0, it represents Zc<32, each of the registers occupies one register, actual effective information data only covers the first Zc bits, and the Zc bits are cyclically filled until the 32 bits are filled. The purpose of the loop filling is to enable efficient operation of the data and reverse shift in the information subsets in subsequent steps.

And 8.1-8.3 are encoding stages, the encoding stages are combined into a kernel function to be executed, the synchronization overhead among thread blocks, the kernel function starting overhead and the access amount of the global memory are reduced, encoding intermediate information is put into the GPU high-speed on-chip memory, the access of the global memory is reduced, and the encoding speed is improved.

Wherein, the check matrix H in step 8 includes: the check matrix H is composed of a base matrix H _b Expand Zc times. Basis matrix H _b The offset indicated by a row element in (b) corresponds to the offset of the H matrix one-layer unit matrix Zc × Zc. The check matrix H can be represented by equation (2), where the subscript Zc represents the base matrix H _b Zc-fold expansion over elements;

（2）

wherein the LDPC encoding of step 8 comprises: the LDPC codeword may be divided into three parts,C=[s p _a p _c ]respectively correspond to the sub-matricesBAnd submatrixIEncoding of LDPC codes uses equation Hc ^T =0 execution.

Wherein the thread grouping in step 8.1 is specificThe method comprises the following steps: the GPU groups the 256 threads in each block into 16 groups, and since Zc is 384 at most and each subgroup contains 384 bits of data at most, each 16 threads are required to be a cooperative group, and each thread is responsible for the operation and calculation of 32 bits of data. In the process of processing the multi-code LDPC coding, the number of threads actually participating in operation in the cooperation group is ceil (Zc/32). Each cooperation group is responsible for solving the multiplication of the bits corresponding to all circulating blocks in one layer of the check matrix H and the check matrix, the solution of each layer finally calculates a group of results, 256 threads support the multiplication of the bits of at most 16 layers of the check matrix H and the check matrix to be simultaneously solved, 16 groups of results are calculated each time, and then all threads solve the operation of the following 16 layers until the operation is completedm _b And solving the group result to finish.

Wherein the information bit subgroup in step 8.1s _i Multiplying the corresponding bits by the check matrix includes, in particular, multiplying the result by the check matrix

It is shown that the process of the present invention,s ^T to representsTranspose, matrix ofA _Zc Sum matrixC _Zc Composed of a plurality of cyclically-shifted blocks, thus a matrixA _Zc And withs ^T Multiplication and matrixC _Zc Ands ^T the multiplication can be expressed as

A _Zc And withs ^T The multiplication result of (A) is

Matrix ofC _Zc Ands ^T the result of multiplication is

. Wherein the content of the first and second substances,iindicates the position of the information subgroup and also represents the number of columns of the cyclic shift block in the base matrix (i= 1, 2, …, k _b )，jRepresents the number of rows of the cyclic shift block in the base matrix (j= 1, 2, …, m _b )，

Which represents a modulo two addition of the two,

represents the first in the base matrixjLine, judge the firstiWhether the cyclic shift block of the column is 0, if 0,

if the average molecular weight is not 0,

thus, therefore, it is

Indicates the information bit subgroup corresponding to the sub-block other than 0s _i (ii) a When the cyclic shift block is not 0, it has a cyclic shift amount

Represents a subgroup of information bitss _i Should be subjected to a cyclic shift of magnitude, and therefore

Show that

Is/are as follows

And (5) cyclic displacement results.

Wherein the efficient shifting of step 8.1 comprises: and traversing each layer, reading the base matrix offset H _ shift by the active working threads in the cooperation group, wherein the label of each thread is tid. When remainders

At time 0, the active worker thread in the cooperative group is responsible for reading two 32-bit information data in the cache region on the high-speed chip, and reading two information elements E from the position lid and the position hid _lid And E _hid Wherein the position of the lid is determined according to the offset H _ shift, lid = (tid + H _ shift)% M, the position of the lid is determined by the lid, and lid = (lid + 1)% M, if remainder =0, the thread in the cooperative group directly merges the two information elements into 64 bits [ E [ _lid |E _hid ]Moving H _ shift%32 bits to the left by an instruction, and then reserving the 32 bits on the left side as output; if remaining renderer>When 0 is needed, threads in the cooperation group are divided into two operations, the first part of threads are hid% M =0, the register position where the lid is located is the last one of the M registers, the hid is the first register, the two registers are not continuous in space position, the position element of the hid needs to be shifted to the left by 32-remainders to form a new hid position element, and then the two information elements are combined into 64 bits [ E [ _lid |E _hid ]Second partial thread hid% M>0, representing that the lid and the hid are continuous, and the calculation process is consistent with the condition that remainder =0, forming [ E _lid |E _hid ]Finally, the pass instruction will [ E _lid |E _hid ]Shift H _ shift%32 bits to the left, and then leave the left 32 bits as output. When remainders<0, the active working thread in the cooperative group is responsible for reading data of 32-bit information in the cache region on the high-speed chip, and reading an information element E from the position id _id Wherein when Zc

16, directly moving H _ shift%32 bits to left by instruction, when Zc>16, the information element E needs to be replaced _id Circularly moving the remainder bit to the right to obtain E _id ', merging two information elements into 64 bits [ E ] _lid |E _hid ]By instruction will [ E _lid |E _hid ]Shift H _ shift%32 bits to the left, leaving the left 32 bits as the output result.

Wherein, step 8.2 lists the corresponding equation set according to the check matrix H type B _ type, and comprises: by using

(z=1,2, 3, 4) represents the first part of parity bitsp _a The (c) of (a) the check subgroup,

to represent

Is

The result of the bit (left shift) cyclic shift,

i.e. buffered in step 8.1

Modulo two addition of the first 4 groups of information bits. Check subgroup

When known, the formula is obtained by an equation system

,

,

;

Wherein the parity bits of step 8.3p _a The multiplication with H calculation includes: computing second part parity check bitsp _c According to the calculation in step 8.2p _a By the efficient shift operation of step 8.1, willp _a Multiplying the position corresponding to the check matrix to obtain

. Will be provided with

And

5 th to 5 th ofm _b Adding the corresponding positions to obtain a second part of parity check bitsp _c The calculation result of (2).

The step 9 of compressing the valid bits includes that the encoded 5GLDPC code is a systematic code, and the code word of the encoded information bits is long, and the GPU device needs to consume a large amount of time to transmit the code word back to the host, so that only the encoded check bits in the cache on the cache chip are partially transported to the global memory by using the encoded information compression method, thereby saving a large amount of transport time. When the remainder is 0, the Zc is an integral multiple of 32, and the information bits cached on the cache chip are continuously copied to the global memory; when remainder is not 0, zc is not an integer multiple of 32, bits filled after Zc bits in each information sub-group need to be deleted, each Zc only contains the first Zc information bits, and all bits are shifted and combined to restore to a tight information bit arrangement. And then, data packing is carried out on the check information bits which are arranged closely according to 32 bits as a group, the tail part of the check information bits which are not 32 bits is filled with 0, and the packed data is written into the position indicated by dst _ offset.

Compared with the prior art, the invention has the advantages and positive effects that:

the invention flexibly divides corresponding computing resources for each code block at a software logic level, supports the simultaneous encoding of the code blocks with different code length and code rate, can effectively reduce the data transmission overhead between a host and equipment in the process of encoding large-scale LDPC code blocks, and improves the flexibility and the practical value of the encoder.

The method fully combines the characteristics of the LDPC coding algorithm and the architecture characteristics of the GPU, fully utilizes GPU resources, improves the access efficiency and the utilization rate of a data computing unit, increases the instruction throughput of operation in a bit packing mode, reduces the resource consumption of a single code block, improves the coding parallelism of the single code block and improves the overall information throughput.

Drawings

FIG. 1 is a schematic diagram of the structure of a GPU-based LDPC high-speed encoder of the present invention.

FIG. 2 is a flow chart of the LDPC high-speed encoding method based on the GPU of the present invention.

Fig. 3 is a schematic diagram of an information compression arrangement according to the present invention.

Fig. 4 is a schematic diagram of bit packing according to the present invention.

FIG. 5 is a schematic diagram of bit-cycling padding according to the present invention.

FIG. 6 is a diagram illustrating bit efficient shifting according to the present invention.

FIG. 7 is a diagram illustrating the compression of the valid code bits according to the present invention.

Fig. 8 is a schematic diagram of splicing compressed parity bits and information bits according to the present invention.

Detailed Description

The invention is explained in detail below with reference to fig. 1 to 8 and the exemplary embodiments.

The invention provides a 5G LDPC code high-speed encoder based on a GPU. FIG. 1 shows a schematic diagram of a high-speed encoder, in whichN _sm Representing the number of the streaming multiprocessors on the GPU chip, wherein the encoder structure comprises a host end and a GPU equipment end; the method comprises the steps that a host memory and a CPU chip are arranged at a host end, and the CPU chip is used for preprocessing coded information, combining code words, controlling and scheduling the whole coding process and sending an information stream to be coded to a GPU (graphic processing unit) device end; the GPU equipment end is provided with a GPU chip, the GPU chip is composed of a plurality of Stream Multiprocessors (SM), each SM is responsible for coding a group of LDPC code blocks with different code types, a logic unit of each SM comprises a global memory, a constant memory, a high-speed on-chip memory, a register memory and the like, and the high-speed on-chip memory stores intermediate calculation information in the coding process, so that the time delay of updating, accessing and storing information layer by layer is reduced; the register memory is used for storing temporary variables such as intermediate quantity and the like generated in the process of calculating memory access, and a large number of threads started in the processor carry out high-speed parallel coding on information flow; the host side and the GPU equipment side carry out data transmission through a high-speed serial computer expansion bus (PCI-E).

The invention also provides a coding method of the 5GLDPC coder based on the GPU, the whole coding process can be divided into 11 steps, a flow chart of the method is shown in figure 2, and the specific coding steps are as follows:

step 1: and initializing the storage space of the host end.

step 2: and initializing a storage space of the GPU device.

The GPU side configures memory space, allocates enough global memory space for information bits and coded bits, and allocates memory space on a high-speed chip to the GPU according to the maximum occupied resource of the current coding;

and 3, step 3: and initializing LDPC check matrix information at the GPU equipment end.

At the GPU equipment end, all LDPC basic matrixes H specified by 5G protocol _b And pre-storing the information, and writing the offset information H _ shift in the base matrix and the position H _ offset of the column of the circulating block in each row into a GPU global memory. Basis matrix H _b Comprises a sub-matrixA、B、C、D、I；

The host end writes the coding information of each code block into a structural body, wherein the coding information comprises a basic graph BG, a lifting value Zc and the number of information bit groupsk _b Check matrix type B _ type, number of coding layersm _b And a corrected value remainder. The method comprises the steps that information data and coding bits are transmitted between a host end and an equipment end in a compression arrangement mode, the structure body further comprises a starting position src _ offset of each code block information bit and a bit storage starting position dst _ offset after coding is completed, coding indication information structure bodies of a plurality of code blocks form a coding indication information structure body array, and the host end transmits all indication information structure body arrays waiting for parallel processing of the code blocks to a GPU global memory. The host end packs the information bits of each code block according to 32 bits, and then packs the information bitsAnd (4) carrying out packed bit compression arrangement on code blocks, and copying information bit data to be coded to a GPU global memory.

And 5: and calling a GPU function to carry out LDPC coding at the host side.

Step 6: and the GPU carries out preprocessing on the received information bits according to the coding indication information.

The method comprises the steps that the thread of each block of the GPU obtains coding indication information of a corresponding code block from a corresponding structural body array, wherein the coding indication information comprises a basic graph BG, a lifting value Zc and the number of information bit groupsk _b Checking matrix type B _ type, correction value remainder, start position src _ offset of information bit, bit storage start position dst _ offset after coding is finished, and then according to start position src _ offset of information bit and information bit group quantityk _b And lifting value Zc to obtain information bitsStarting from the position indicated by src _ offset, to ceil (Zc;)k _b The/32) × 4 bytes of space contain the information bits of the code block, ceil (, indicates rounding up.sRepresents a sub-matrixACorresponding to the information bit part received by the encoder, it is divided intok _b A sub group of information bitss _i ,(i= 1, 2, …, k _b ) Each group corresponds to Zc bits. GPU to information bitsEach information bit subgroup ofs _i And circularly filling to a multiple of 32 according to the corrected value remainder, wherein each information bit subgroup corresponds to one circular block of the base matrix.

And step 8: the GPU codes the LDPC in parallel. The encoding of the LDPC code is performed using a check matrix H, which is solved by solving an equation to obtain a codewordcThe encoding stage comprises the following 3 steps:

step 8.1:grouping threads, establishing mapping relation between threads and information processing, and calculating by efficient shift operationk _b A sub-group of information bitss _i Result of multiplication of corresponding bits with check matrix

The result will be calculatedm _b The group elements are cached in the high-speed on-chip memory in preparation for the later steps to calculate the parity bits.

Step 8.2: calculating a first part of parity check bits according to the coding indication informationp _a

First part parity check bitsp _a Comprising 4 groups of check bits

(z=1,2, 3, 4), so the first 4 layers of equations in the check matrix H need to be solved, the corresponding equation set is listed according to the type B _ type of the check matrix H, and the information bits in step 8.1 are grouped into groupss _i Multiplying the corresponding bit by check matrix, solving equation set according to the multiplication result, reading the first 4 sets of results of vector result cached in the high-speed on-chip memory, and combining the equation to obtain the first part of parity check bitp _a 。

，

Then will be

And with

And step 9: the GPU compresses the encoded parity check bits and calculates the parity check bitspIn common withm _b And each group needs to delete the redundant bits according to the corrected value remainder to obtain Zc effective bits of each group, compress the effective bits, and write the result of the compressed check bits into the global memory according to the bit storage initial position dst _ offset after the coding is finished.

Step 11: and the host end splices the received compression check information and the information bits to form complete LDPC coding information. As shown in fig. 8, the host side sends the information bits to the GPU side through the PCI-E bus, the GPU side performs high-speed encoding, and then returns only the check bits to the host side through the PCI-E bus, and the host side then splices the information bits and the check bits to form a complete codeword.

The calculation of the corrected value remainder in the step 4 specifically comprises the following operations: setting different correction values remainder according to the magnitude of the lifting value Zc, determining the magnitude of the correction value remainder according to the remainders of Zc and 32, and when Zc is greater than the value

Corrected value remainder = Zc%32; when Zc<Corrected value remainder at 32 hours=－Zc%32。remainder=0 represents that the information bits can be packed exactly into a set of unsigned 32-bit integer variables. For example, zc=80

32，remainder= Zc%32= 16。

The specific operation of the compression arrangement in the step 4 is as follows: the PCI-E transmission speed is limited, a fixed step size is selected when a group of different LDPC code blocks are coded, and the LDPC code (ZC) with a smaller lifting value ZC is used

256 Needs to be filled with a large amount of redundant data 0 during encoding, resulting in a long transmission time, so data is transmitted in a way of compressing data amount, and a set of offset information of the start position of the code block relative to the start position of the first code block is generated by the host end to indicate the position of each code block in a set of data, the offset of each code block relative to the start position of the first code block is the sum of the effective lengths of all the previous code blocks, and compared with a fixed step size, for a lifting value Zc=2, transmission can save the transmission data amount by 192 times at most. Dst _ offset as shown in FIG. 3 _n 、P _n 、Z _n Respectively shows the start position, byte size, lifting value size, P of the nth code block _n =ceil(m _b *Zc _n /32) × 4 bytes, each block storing only its own valid byte P _n The 0's of the other redundant positions are not stored, and the start position dst _ offset of each code block _n All pass through the start position dst _ offset of the previous code block _n-1 Plus byte P of the previous code block _n-1 The obtained sum is the effective length of all the code blocks, the compression mode hardly generates redundant bits, and the transmission data volume is reduced.

The specific operation of bit packing in step 4 is as follows: as the information data is bits of 0 or 1, the information bits are packed in groups of 32 bits, as shown in fig. 4, for a string of information bit data [ 01 0.. 0.1.1 0.. 0.0.1.0.. 0.1.1.0.. 0.. 0.1.0.. 1], each 32 bits are divided into one group, and the bits in each group are just put into one register, so that GPU resources can be fully utilized, and each thread in the GPU controls 32 bits to perform parallel computation, thereby greatly increasing the parallel computation efficiency.

The specific operation of the cyclic filling in the step 6 is as follows: for the sub-group of information bits according to the size of the correction value remainders _i The middle Zc bit data is filled, because the memory size of a register in the GPU is 32 bits (4 bytes) and needs to be filled to integral multiple of 32, the information data stored by using 32 bits as basic symbols is formed, and therefore, each time, each Zc bit data is stored by using 32 bits as basic symbolss _i Occupy M registers, M = ceil (Zc/32), register denoted asR _t ，t=1,2, …, M, actual valid information data covers only the first Zc bits in the M registers. Specifically, when remainder =0, zc is an integer multiple of 32, and the valid information of each group fills all M registers without filling any data; when remainders>When 0, it means that Zc is not an integer multiple of 32 and Zc>32，s _i Does not fill up M registers, eachs _i Starting to fill the same group after Zc bits actually occupieds _i The header data bit is filled until the data in the last register is completed. When remainders<When 0, it represents Zc<32, each occupies one register, the actual effective information data only covers the first Zc bits, and the Zc bits are circularly filled until the 32 bits are filled. The purpose of the loop filling is to enable efficient operation of the data and reverse shift in the information subsets in subsequent steps. As shown in fig. 5, a first subset of information bits is showns ₁ When Zc =80 and remaining =16, the bit index is [0,1,2,3, …,79 [ ]]Since floor (80/32) = 2, 80% = 32, the first 2 32-bit symbols are normally filled with an index of [0,1,2, …,63 = 32-bit symbol]The 3rd group of symbols is divided into two parts, each part is 16 bits, the first half part of the 3rd symbol is filled with indexes of [64,65, …,79]The last half of the 3rd symbol is filled with an index [0,1, …, 15%]16 bits of information.

The LDPC encoding in step 8 specifically operates as follows: the LDPC codeword may be divided into three parts,C=[s p _a p _c ]whereinsRepresenting a sub-matrixACorresponding to the information bits received by the encoder, respectively corresponding to the sub-matricesBAndIthe encoding of the LDPC code is performed using equation (3):

(3)

then, equation (3) is naturally divided into equation (4) and equation (5):

(4)

(5)

in the formulae (4) and (5)TRepresenting the transpose of the matrix.

In step 8.1, the specific operation of thread grouping is as follows: the GPU groups the 256 threads in each block into 16 groups, and since Zc is 384 at maximum and each subgroup contains 384 bits of data at maximum, each 16 threads are required to be a cooperation group, and each thread is responsible for 32 bits of data operation and calculation. In the process of processing the multi-code LDPC coding, the number of threads actually participating in operation in the cooperation group is ceil (Zc/32). Each cooperation group is responsible for solving the multiplication of the bits corresponding to all cyclic shift blocks of one layer in the check matrix H and the check matrix, the solution of each layer finally calculates a group of results, 256 threads support the multiplication of the bits of 16 layers at most in the check matrix H and the check matrix and simultaneously solve, 16 groups of results are calculated each time, and then all threads solve the operation of the following 16 layers until the operation is completedm _b And solving the group result to finish.

Wherein, in step 8.1, the information bits are groupeds _i The specific operation of multiplying the check matrix is as follows: result of multiplication by

It is shown that,s ^T representsTranspose, matrix ofA _Zc Sum matrixC _Zc By a plurality ofA cyclic shift block, thus a matrixA _Zc Ands ^T multiplication and matrixC _Zc Ands ^T the multiplication can be expressed as

A _Zc And withs ^T Is multiplied by

Matrix ofC _Zc Ands ^T the multiplication result is

. Wherein the content of the first and second substances,iindicates the position of the information subgroup and also represents the number of columns of the cyclic shift block in the base matrix (i= 1, 2, …, k _b )，jRepresenting the number of rows of the cyclic shift block in the base matrix: (j= 1, 2, …, m _b )，

Which represents a modulo two addition of the two,

represents the first in the base matrixjGo to, judgeiWhether the cyclic shift block of the column is 0, if 0,

if the average molecular weight is not 0,

thus, therefore, it is

Indicates the information bit subgroup corresponding to the non-0 sub-blocks _i (ii) a When the cyclic shift block is not 0, it has cyclic shift amount

Show that

Is/are as follows

And (5) cyclic displacement results.

The specific operation of the efficient shift of step 8.1 is as follows: and traversing each layer according to the degree, reading the offset H _ shift of the base matrix by the effectively working threads in the cooperation group, wherein the label of each thread is tid. When remainders

0, the active working thread in the cooperative group is responsible for reading two 32-bit information data in the cache region on the high-speed chip, and two information elements E read from the position lid and the position hid _lid And E _hid Wherein the position of the lid is determined according to the offset H _ shift, lid = (tid + H _ shift)% M, the position of the lid is determined by the lid, and lid = (lid + 1)% M, if remainder =0, the thread in the cooperative group directly merges the two information elements into 64 bits [ E [ _lid |E _hid ]Moving H _ shift%32 bits to the left by an instruction, and then reserving the 32 bits on the left side as output; if remaining renderer>When 0 is needed, threads in the cooperation group are divided into two operations, the first part of threads are hid% M =0, the register position where the lid is located is the last one of the M registers, the hid is the first register, the two registers are not continuous in space position, the position element of the hid needs to be shifted to the left by 32-remainders to form a new hid position element, and then the two information elements are combined into 64 bits [ E [ _lid |E _hid ]Second partial thread hid% M>0, representing that the lid and the hid are continuous, and the calculation process is consistent with the condition that remainder =0, forming [ E _lid |E _hid ]Finally, the pass instruction will [ E _lid |E _hid ]Left shift H _ shift%32 bits and then reserveThe left 32 bits are output. When remainders<0, the active worker thread in the cooperative group is responsible for reading a 32-bit message of data in the cache on the cache, and reading the message element E from the location id _id Wherein when Zc

When 16, the left is moved by 32 bits H _ shift% directly by the instruction, when ZC>16, the information element E needs to be replaced _id Circularly moving the remainder bit to the right to obtain E _id ', merging two information elements into 64 bits [ E ] _lid |E _hid ]By instruction will [ E _lid |E _hid ]Shift H _ shift%32 bits to the left, leaving the left 32 bits as the output result.

Fig. 6 illustrates the first group when remaining =16s ₁ When Zc =80, the first groups ₁ The index is [0,1,2, …,79, 0,1, …,15]The method occupies M =3 registers, and data is sequentially loaded in the registersm ₁ ，m ₂ ，m ₃ Above, 3 active worker threads tid in each collaboration group: 0. 1,2 are responsible for reading two 32-bit information data in the cache on-chip, two information elements E read from location lid and location hid _lid And E _hid Where the position of lid is determined by the offset H _ shift, H _ shift =10 in fig. 6, lid = (tid + H _ shift)% M, the lid position is determined by lid, and lid = (lid + 1)% M, so thread 0 loadsm ₁ ，m ₂ Data, thread 1 loadm ₂ ，m ₃ Data, thread 2 loadm ₃ ，m ₁ The data of (1). The thread 2 satisfies hid% M =0, which indicates that the register position of the thread 2lid is the last register of the M registers, and hid is the first register, and the two are spatially discontinuous, so that the position element of hid is shifted to the left by 16 to form a new hid position element, and then the two information elements are combined into 64 bits [ E =0 ] _lid |E _hid ](ii) a Thread 0, thread 1 satisfy hid% M>0, representing that the lid and the hid are continuous, and the calculation process is consistent with the condition that remainder =0, forming [ E _lid |E _hid ]. Finally the pass instruction will [ E _lid |E _hid ]Shift H _ shift%32=10 bits to the left, and then leave the 32 bits on the left as output, resulting in the output result as shown in fig. 6.

Wherein, step 8.2 lists the corresponding equation set according to the H matrix type B _ type, including: writing equation (4) in block form can be extended to the following equation set (6):

（6）

wherein the content of the first and second substances,

to representp _a The check sub-group of (1),

to represent

Is/are as follows

(left shift) cyclic shift result. Obtaining a check subgroup by adding all the equations

As a result of which,

i.e. buffered in step 8.1

Modulo two addition of the first 4 groups of information bits. Check subgroup

When known, the formula is obtained by an equation system

,

,

;

Wherein, step 8.3p _a The specific operation of the multiplication by H calculation is: calculating the second part of parity bits using equation (5)p _c According to the calculation in step 8.2p _a By the efficient shift operation of step 8.1, willp _a Multiplying the position corresponding to the check matrix to obtain

. Will be provided with

And

The specific operation of compressing the effective coding bits in step 9 is that the encoded 5GLDPC code is a systematic code, and the code word after the information bit is encoded is long, and the GPU device needs to consume a large amount of time to transmit the code word back to the host, so that only the encoded check bits in the cache on the cache chip are partially transported to the global memory by using the encoded information compression method, thereby saving a large amount of transport time. When the remainder is 0, the Zc is an integral multiple of 32, and the information bits cached on the cache chip are continuously copied to the global memory; when remainder is not 0, zc is not an integer multiple of 32, bits filled after Zc bits in each information sub-group need to be deleted, each Zc only includes the first Zc information bits, and then all bits are shifted and combined to restore to a tight information bit arrangement.

As shown in fig. 7, zc =80, remaider =16, taking the first group of information bits as an example, the index is [0,1,2,3, …,79], since floor (80/32) = 2, 80, 32=16, the first 2-bit symbol is normally carried, the filled index is [0,1,2, …,63], the 3rd symbol is split into two parts, each part is 16 bits, the 3rd symbol first half part is filled with 16-bit information with the index of [64,65, …,79], and the 3rd symbol second half part is filled with 16-bit information with the next group of information bits with the index of [0,1, …,15 ]. And then, data packing is carried out on the check information bits which are arranged closely according to 32 bits as a group, the tail of the check information bits which are not enough for 32 bits is filled with 0, and the packed data is written into the dst _ offset indication position.

Claims

1. A coding method of a 5GLDPC coder based on a GPU is disclosed, wherein the coder comprises a host end and a GPU equipment end; the method comprises the steps that a host memory and a CPU chip are arranged at a host end, and the CPU chip is used for preprocessing coded information, combining code words, controlling and scheduling the whole coding process and sending an information stream to be coded to a GPU (graphic processing unit) device end; the GPU equipment end is provided with a GPU chip, the GPU chip consists of a plurality of stream multiprocessors SM, each SM is responsible for coding a group of low-density parity check code LDPC code blocks with different code types, and a large number of threads started in the SM carry out high-speed parallel coding on information streams; the host side and the GPU equipment side carry out data transmission through a high-speed serial computer expansion bus PCI-E;

based on above-mentioned encoder, its characterized in that: the encoding method comprises the following specific steps:

step 1: initializing a storage space of a host end;

step 2: initializing a storage space of a GPU (graphics processing unit) device end;

and 3, step 3: initializing LDPC check matrix information of a GPU device end;

in GPU deviceThe backup end applies all LDPC base matrix H specified by the 5G protocol _b Pre-storing the information, and writing the offset information H _ shift in the base matrix and the position H _ offset of the row where the circulating block in each row is located into a global memory of the GPU; basis matrix H _b Comprises a matrixA、B、C、D、I；

And 4, step 4: copying coding indication information and data information from a host to a GPU;

the host end writes the coding information of each code block into a structural body, which comprises a basic graph BG, a lifting value Zc and the number of information bit groupsk _b Check matrix type B _ type, number of coding layersm _b Corrected value remainder; the method comprises the steps that information data and coding bits are transmitted between a host end and an equipment end in a compression arrangement mode, a starting position src _ offset of each code block information bit and a bit storage starting position dst _ offset after coding are further included in a structure body, coding indication information structure bodies of a plurality of code blocks form a coding indication information structure body array, and the host end transmits all indication information structure body arrays waiting for parallel processing of the code blocks to a GPU global memory; the host end carries out bit packing on the information bits of each code block according to 32 bits, then the packed bits of all the code blocks are compressed and arranged, and the information bit data to be coded are copied to a GPU global memory;

and 5: calling a GPU function at a host side to perform LDPC coding;

the method comprises the steps that a host end sets the number of blocks grouped by a GPU virtual processor to be equal to the maximum code block number num _ C, the block dimensionality is one-dimensional, and a plurality of blocks process LDPC code word codes of different code types in parallel; the host end sets the number of the starting threads of each block to be 256, and the thread dimension is one-dimensional;

step 6: the GPU carries out preprocessing on the received information bits according to the coding indication information;

the method comprises the steps that the thread of each block of the GPU obtains coding indication information of a corresponding code block from a corresponding structural body array, wherein the coding indication information comprises a basic graph BG, a lifting value Zc and the number of information bit groupsk _b Check matrix type B _ type, corrected value remaining, start of information bitStarting position src _ offset, bit-storing starting position dst _ offset after encoding is completed, and then according to the starting position src _ offset of information bits, number of information bit groupsk _b And lifting value Zc acquisition information bitsStarting from the position indicated by src _ offset, to ceil (Zc;)k _b -32) × 4 bytes of space containing the information bits of the code block, ceil (—) representing rounding up; is divided intok _b A sub group of information bitss _i , i= 1, 2, …, k _b Each group corresponds to Zc bits; GPU to information bitsEach information bit subgroup ofs _i Circularly filling to a multiple of 32 according to the corrected value remainder, wherein each subgroup corresponds to a circular block of the base matrix;

and 7: the GPU carries the processed data information to an on-chip internal memory of a high speed chip;

and 8: the GPU carries out parallel coding on the LDPC; encoding of LDPC codes is performed using a check matrix H by solving equation Hc ^T =0 decoding of code wordscWherein, in the step (A),c ^T representing a code wordcTransposing; the encoding stage comprises the following 3 steps in total:

step 8.1: grouping threads, establishing mapping relation between threads and information processing, and calculating by efficient shift operationk _b A sub group of information bitss _i Result of multiplication of corresponding bits with check matrix

，

(ii) a Will calculate the resultm _b Caching the group elements into an internal memory on a high-speed chip to prepare for calculating parity check bits in the following steps;

step 8.2: calculating a first portion of parity bits based on the coding indication informationp _a ；

First part parity check bitsp _a Comprising 4 groups of check bits

，z=1,2, 3, 4; solving the front 4 layers of equations in the check matrix H, listing the corresponding equation set according to the type B _ type of the check matrix H, and grouping the information bits in the step 8.1s _i Multiplying the corresponding bit by check matrix, solving equation set according to the multiplication result, reading the first 4 sets of results of vector result cached in the high-speed on-chip memory, and solving the first part of parity check bitp _a ；

Step 8.3: according to step 8.2p _a Computing second partial parity bitsp _c ；

，

Then will be

And

the corresponding groups are modulo-two added due to the second part of the parity bitsp _c The corresponding check matrix is a sub-matrixIThe addition result is the second part of parity check bitsp _c The result is;

and step 9: the GPU compresses the encoded parity check bits and calculates the parity check bitspIn common withm _b Each group needs to delete the redundant bit according to the correction value remainder to obtain Zc effective bits of each group, the effective bits are compressed, and the compressed check ratio is compared according to the bit storage initial position dst _ offsetWriting the special result into a global memory;

step 10: transmitting the compression information in the global memory back to the host end;

2. The encoding method of a GPU-based 5GLDPC encoder according to claim 1, wherein: in step 3, the LDPC basis matrix H _b The information includes: the 5G NR standard specifies two basic matrixes, namely BG1 and BG2, and has a structure shown in a formula (1) and a size ofm _b ×n _b ，m _b Represents the number of coding layers, i.e. the number of rows of the base matrix,n _b represents the number of columns of the base matrix,n _b =m _b +k _b ；

（1）

basis matrix H _b The upper right 0 region indicates that all 0 s are in the part, whereAIs 4 dimensionk _b ，BThe dimension of (a) is 4 x 4,Chas the dimensions ofm _b −4)×k _b ，DHas the dimensions ofm _b −4)×4，IHas the dimensions ofm _b −4)×(m _b -4); basis matrix H _b The offset H _ shift corresponding to all the cyclic blocks is included, the position H _ offset of the row of each cyclic block in a row is obtained through calculation, and all the basic information structures are all transmitted into a GPU memory before decoding.

3. The encoding method of a GPU-based 5GLDPC encoder according to claim 1, wherein: in step 4, the check matrix type B _ type includes: sub-matrixBIs a dimensional 4 x 4 square matrix with 4 different types, the 5G protocol specifiesA fixed B _ type corresponding to a BG pattern;

the correction value remainder includes: setting different correction values remainder according to the magnitude of the lifting value Zc, determining the magnitude of the correction value remainder according to the remainders of Zc and 32, and when Zc is greater than the value

Corrected value remainder = Zc%32; when Zc<At 32, the corrected value remainder = -Zc%32; remainder =0 represents the packing of information bits into a set of unsigned 32-bit integer variables;

the way of compressing the arrangement comprises: the PCI-E transmission speed is limited, when a group of different LDPC code blocks are coded, a fixed step length is selected, when the LDPC code with the lifting value Zc is coded, a large amount of redundant data 0 needs to be filled, and the transmission time is long, so that data is transmitted in a data volume compression mode, a group of offset information of the starting position of the code block relative to the starting position of the first code block is generated through a host end and used for indicating the position of each code block in the group of data, and the offset of each code block relative to the starting position of the first code block is the sum of the effective lengths of all the code blocks in front;

the bit packing comprises the following steps: because the information data is bits of 0 or 1, the information bits are packed into a group according to every 32 bits, GPU resources can be fully utilized, and each thread in the GPU controls 32 bits to perform parallel computation.

4. The encoding method of a GPU-based 5GLDPC encoder according to claim 1, wherein: in step 6, the cyclic filling comprises: for the sub-group of information bits according to the size of the correction value remainders _i And filling middle Zc bit data, wherein the memory size of a register in the GPU is 32 bits and needs to be filled to be integral multiple of 32 bits to form information data stored by using 32 bits as basic symbols, so that each information data is stored by using 32 bits as basic symbolss _i Occupying M registers, M = ceil (Zc/32), the registers being denoted asR _t ，t=1,2, …, M, actual valid information data covers only the first Zc bits in the M registers; when remainder =0, zc is an integer of 32The effective information of each group fills all M registers without filling any data; when remainders>When 0, it means that Zc is not an integer multiple of 32 and Zc>32，s _i Does not fill up M registers, eachs _i Starting to fill the same group after Zc bits actually occupieds _i The data bit of the middle head part is added until the data in the last register is completed; when remainders<When 0, it represents Zc<32, each occupies one register, the actual effective information data only covers the first Zc bits, and the Zc bits are circularly filled until the 32 bits are filled.

5. The encoding method of a GPU-based 5GLDPC encoder according to claim 1, wherein: in step 8, the check matrix H is formed by the base matrix H _b Expanding to obtain; basis matrix H _b The offset indicated by the element in the row corresponds to the offset of a unit matrix ZC multiplied by ZC of a layer of the H check matrix; the check matrix H is expressed as formula (2), wherein the subscript Zc represents the basis matrix H _b Zc-fold expansion on elements

（2）

The LDPC encoding includes: the LDPC codeword is divided into three parts,C=[s p _a p _c ]respectively corresponding to the sub-matricesBAnd submatrixIEncoding of LDPC codes uses equation Hc ^T =0 execution.

6. The encoding method of a GPU-based 5GLDPC encoder according to claim 1, wherein: the thread grouping in step 8.1 specifically includes: the GPU divides 256 threads in each block into 16 groups, zc is 384 at most, each subgroup contains 384 bits of data at most, each 16 threads are required to be a cooperation group, and each thread is responsible for the operation and calculation of 32 bits of data; in processing multi-code LDPC coding, the actual in the cooperative groupThe number of threads participating in the operation is ceil (Zc/32); each cooperation group is responsible for solving the multiplication of the bits corresponding to all cyclic shift blocks of one layer in the check matrix H and the check matrix, a group of results is finally calculated by solving the solutions of each layer, 256 threads support the multiplication of the bits of 16 layers at most in the check matrix H and the check matrix, the solutions are simultaneously calculated, 16 groups of results are calculated each time, and then all threads solve the operation of the following 16 layers until the operation is completedm _b Solving the group result;

information bit subgroup in step 8.1s _i The multiplying of the corresponding bits by the check matrix specifically includes: result of multiplication by

It is shown that the process of the present invention,s ^T to representsTranspose, matrix ofA _Zc Sum matrixC _Zc Composed of a plurality of cyclic shift blocks, thus a matrixA _Zc And withs ^T Multiplication and matrixC _Zc Ands ^T all the multiplications are expressed as

，A _Zc And withs ^T The multiplication result of (A) is

Matrix ofC _Zc Ands ^T the multiplication result is

(ii) a Wherein the content of the first and second substances,iindicating the position of the information subset and the number of columns of the cyclic shift block in the base matrix,i= 1, 2, …, k _b ； jrepresenting the number of rows of the cyclic shift block in the base matrix,j= 1, 2, …, m _b ，

which represents a modulo two addition of the two,

if the average molecular weight is not 0,

thus, therefore, it is

Indicates the information bit subgroup corresponding to the non-0 sub-blocks _i (ii) a When the cyclic shift block is not 0, the cyclic shift amount is provided

Show that

Is/are as follows

Cyclic displacement results;

the efficient shifting in step 8.1 comprises: traversing each layer, reading the offset H _ shift of the base matrix by the effectively working threads in the cooperation group, wherein the label of each thread is tid; when remainders

At 0, the active worker thread in the cooperative group is responsible for reading two 32-bit data in the cache region on the high-speed chip, from the location lid andposition hid reads two information elements E _lid And E _hid Wherein the position of the lid is determined according to the offset H _ shift, lid = (tid + H _ shift)% M, the position of the lid is determined by the lid, and lid = (lid + 1)% M, if remainder =0, the thread in the cooperative group directly merges the two information elements into 64 bits [ E [ _lid |E _hid ]Moving H _ shift%32 bits to the left by an instruction, and then reserving the 32 bits on the left side as output; if remaining renderer>When 0 is needed, the thread in the cooperative group is divided into two operations, the first part of threads are hid% M =0, the register position where the lid is located is the last one of the M registers, the hid is the first register, the two registers are not spatially continuous, the position element of the hid needs to be shifted to the left by 32-remainders to form a new hid position element, and then the two information elements are combined into 64-bit [ E [ _lid |E _hid ]Second partial thread hid% M>0, representing that the lid and the hid are continuous, and the calculation process is consistent with the condition that remainder =0, forming [ E _lid |E _hid ]Finally, the pass instruction will [ E _lid |E _hid ]Shift H _ shift%32 bits to the left, then reserve the left 32 bits as output; when remainders<0, the active worker thread in the cooperative group is responsible for reading a 32-bit message of data in the cache on the cache, and reading the message element E from the location id _id Wherein when Zc

16, move left H _ shift%32 bits directly by instruction, when ZC>16, the information element E needs to be replaced _id Circularly moving the remainder bit to the right to obtain E _id ', merging two information elements into 64 bits [ E ] _lid |E _hid ]By an instruction will [ E _lid |E _hid ]Shift H _ shift%32 bits to the left, leaving the left 32 bits as the output result.

7. The encoding method of a GPU-based 5GLDPC encoder according to claim 6, wherein: in step 8.2, listing the corresponding system of equations according to the check matrix H type B _ type includes: by using

Representing a first part of check bitsp _a The (c) of (a) the check subgroup,z= 1, 2, 3, 4；

to represent

Is/are as follows

As a result of the cyclic shift of the bits,

i.e. buffered in step 8.1

Modulo two addition of the first 4 groups of information bits; check subgroup

When known, the formula is obtained by an equation system

,

,

。

8. The encoding method of a GPU-based 5GLDPC encoder according to claim 7, wherein: according to the calculation in step 8.2p _a By the efficient shift operation of step 8.1, willp _a Multiplying the position corresponding to the check matrix to obtain

，

(ii) a Will be provided with

And

5 th to 5 th inm _b Adding the corresponding positions to obtain a second part of parity check bitsp _c The result of (1).

9. The encoding method of a GPU-based 5g lcd pc encoder of claim 1, wherein: in step 9, the valid bits are compressed into: the coded 5GLDPC code is a system code, the code word coded by the information bit is long, the GPU equipment needs to consume a large amount of time to transmit the code word back to a host end, and only the coded check bit part in the cache on the high-speed chip is carried to a global memory by adopting a coding information compression mode, so that the carrying time is saved; when the remainder is 0, the Zc is an integral multiple of 32, and the information bits cached on the cache chip are continuously copied to the global memory; when remainder is not 0, zc is not an integer multiple of 32, bits filled after Zc bits in each information sub-group need to be deleted, each Zc only contains the first Zc information bits, and all bits are shifted and combined to restore to a tight information bit arrangement; and then, data packing is carried out on the check information bits which are arranged closely according to 32 bits as a group, the tail part of the check information bits which are not 32 bits is filled with 0, and the packed data is written into the position indicated by dst _ offset.