CN115118289B - Coding method of 5GLDPC encoder based on GPU - Google Patents

Coding method of 5GLDPC encoder based on GPU Download PDF

Info

Publication number
CN115118289B
CN115118289B CN202211037856.7A CN202211037856A CN115118289B CN 115118289 B CN115118289 B CN 115118289B CN 202211037856 A CN202211037856 A CN 202211037856A CN 115118289 B CN115118289 B CN 115118289B
Authority
CN
China
Prior art keywords
bits
information
gpu
matrix
bit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211037856.7A
Other languages
Chinese (zh)
Other versions
CN115118289A (en
Inventor
刘荣科
李岩松
田铠瑞
王若诗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202211037856.7A priority Critical patent/CN115118289B/en
Publication of CN115118289A publication Critical patent/CN115118289A/en
Application granted granted Critical
Publication of CN115118289B publication Critical patent/CN115118289B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/03Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words
    • H03M13/05Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits
    • H03M13/11Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits using multiple parity bits
    • H03M13/1102Codes on graphs and decoding on graphs, e.g. low-density parity check [LDPC] codes
    • H03M13/1148Structural properties of the code parity-check or generator matrix
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/42Bus transfer protocol, e.g. handshake; Synchronisation
    • G06F13/4282Bus transfer protocol, e.g. handshake; Synchronisation on a serial bus, e.g. I2C bus, SPI bus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/781On-chip cache; Off-chip memory
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/65Purpose and implementation aspects
    • H03M13/6569Implementation on processors, e.g. DSPs, or software implementations
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L1/00Arrangements for detecting or preventing errors in the information received
    • H04L1/004Arrangements for detecting or preventing errors in the information received by using forward error control
    • H04L1/0056Systems characterized by the type of code used
    • H04L1/0061Error detection codes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2213/00Indexing scheme relating to interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F2213/0026PCI express

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Signal Processing (AREA)
  • Mathematical Physics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computing Systems (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Error Detection And Correction (AREA)

Abstract

The invention provides a coding method of a 5GLDPC coder based on a GPU, which comprises the following steps: 1: initializing a storage space of a host end; 2: initializing a storage space of a GPU (graphics processing unit) device end; 3: initializing LDPC check matrix information of a GPU device end; 4: copying coding indication information and data information from a host to a GPU; 5: calling a GPU function by a host side to perform LDPC coding; 6: the GPU carries out preprocessing on the received information bits; 7: the GPU carries the processed data information to a high-speed on-chip internal memory; 8: the GPU carries out parallel coding on the LDPC; 9: the GPU compresses and calculates the encoded parity check bits; 10: transmitting the compressed information back to the host end; 11: and the host end splices the received compression check information and the information bits to form complete LDPC coding information. The invention can improve the throughput of the encoder, and enables the encoder to have the characteristics of flexibility and high speed.

Description

Coding method of 5GLDPC encoder based on GPU
Technical Field
The invention belongs to the technical field of communication, and relates to a coding method of a 5GLDPC (Graphics Processing Unit) encoder based on a Graphics Processing Unit (GPU).
Background
Recently, low density parity check codes (LDPC) play an important role in 5G communications and have been selected as a coding scheme for enhanced mobile broadband (eMBB) data channels, using quasi-cyclic low density parity check codes (QC-LDPC) for coding. Because the 5G QC-LDPC has various check matrixes and lifting values, LDPC code words in different forms can be combined, and the design difficulty of the encoder is extremely high. Considering the air interface resource allocation of the 5G mobile communication system, the uplink and downlink rates generally have a corresponding relationship, so the LDPC encoder of the base station must complete a large number of codeword encoding within a specified time slot, and the encoding process faces a huge delay challenge, and therefore, designing a 5GLDPC encoder with low delay, high throughput, and low complexity is a difficult problem to be solved.
Nguyen et al propose a new efficient coding method and a high throughput low complexity encoder architecture, with significantly reduced chip area and memory space consumption. By storing the quantized values of the permutation information for each sub-matrix instead of the entire parity check matrix, the required memory storage is reduced while maintaining high throughput. (see reference [1]: NGUYEN T T B, NGUYEN TAN T, LEE H. Effective QC-LDPC Encoder for 5G New radio [ J ]. Electronics, 2019, 8 (6): 668.). Tian et al studied the parallel design and implementation of QC-LDPC encoders, using a multi-channel parallel structure to obtain multiple parity check bits, thereby significantly reducing coding delay. And the high-parallelism coding algorithm is mapped to a configurable circuit structure, and all 5G NR code lengths and code rates can be supported. (see reference [2]: tian Y, bai Y, liu D. Low-Latency QC-LDPC Encoder Design for 5G NR [ J ]. Sensors (Basel, switzerland), 2021, 21 (18): 6266.) Liao et al propose a LDPC coding method based on GPU. The LDPC coding with the code rate of 1/2 to 8/9 realizes higher throughput on a single GPU, and experiments show that parallel simulation tasks based on the GPU can achieve good balance between performance and cost. ( See reference [3]: s, liao, Y, zhan, Z, shi and L, yang, A High through High and Flexible Rate 5G NR LDPC Encoder on a Single GPU [ C ]//2021 23rd International Conference on Advanced Communication Technology (ICACT), 2021, pp. -34. )
In recent years, related experts and researchers have made a lot of attempts to use a 5G LDPC code high-speed encoder based on a dedicated hardware platform (ASIC, FPGA) and a general processing platform, and the LDPC encoder based on dedicated hardware can obtain lower delay and higher energy efficiency, and in consideration of the encoding requirement of a base station for measuring a large number of code blocks, a large number of dedicated chips need to be deployed in batch to meet the high throughput performance requirement. Therefore, the development period is long, the operation and maintenance difficulty is high, and the development requirement of the future communication system for task diversification is difficult to meet. The LDPC encoder designed based on the GPU general processing platform is realized through software design, and deployment and parameter configuration are flexibly performed by adopting upgrading on a program level, but the LDPC encoding algorithm based on the GPU at the present stage does not fully utilize GPU computing resources, and a parallel processing design method of 5GLDPC encoding still needs to be improved.
Disclosure of Invention
The invention provides a coding method of a 5GLDPC coder based on a GPU, which fully utilizes GPU computing resources to carry out large-scale parallel processing on LDPC codes, further improves the throughput of the coder, supports parallel processing of LDPC codes of different code types and enables the coder to have the characteristics of flexibility and high speed.
The invention firstly provides a 5GLDPC high-speed encoder based on a GPU, and the encoder structure mainly comprises a host end and a GPU equipment end; the method comprises the steps that a host memory and a CPU chip are arranged at a host end, and the CPU chip is used for preprocessing coded information, combining code words, controlling and scheduling the whole coding process and sending an information stream to be coded to a GPU (graphic processing unit) device end; the GPU equipment end is provided with a GPU chip, the GPU chip consists of a plurality of Stream Multiprocessors (SM), each SM is responsible for coding a group of LDPC code blocks with different code types, and a large number of threads started in the processors carry out high-speed and parallel coding on information streams; the host side and the GPU equipment side carry out data transmission through a high-speed serial computer expansion bus (PCI-E).
The invention also provides a coding method of the 5GLDPC coder based on the GPU, the whole coding process can be divided into 11 steps, and based on the coder, the specific coding steps are as follows:
step 1: and initializing the storage space of the host side.
Allocating enough storage space for information bits and coded bits at a host end according to the maximum code block number num _ C processed simultaneously;
step 2: and initializing a storage space of the GPU device.
The GPU equipment side configures a memory space, allocates enough global memory space for information bits and coded bits, and allocates memory space on a high-speed chip to the GPU according to the maximum occupied resource of the current coding;
and step 3: and initializing LDPC check matrix information of the GPU equipment terminal.
At GPU equipment end, all LDPC base matrixes H specified by 5G protocol are used b And pre-storing the information, and writing the offset information H _ shift in the base matrix and the position H _ offset of the row where the circulating block in each row is located into a global memory of the GPU. Basis matrix H b Comprises a sub-matrixABCDI
And 4, step 4: the host copies the coding indication information and the data information to the GPU.
The host end writes the coding information of each code block into a structural body, wherein the structural body comprises a basic graph BG, a lifting value Zc and the number of information bit groupsk b Check matrix type B _ type, number of coding layersm b And a corrected value remainder. The method comprises the steps that information data and coding bits are transmitted between a host end and an equipment end in a compression arrangement mode, the structure body further comprises a starting position src _ offset of each code block information bit and a bit storage starting position dst _ offset after coding is completed, coding indication information structure bodies of a plurality of code blocks form a coding indication information structure body array, and the host end transmits all indication information structure body arrays waiting for parallel processing of the code blocks to a GPU global memory. And the host end performs bit packing on the information bits of each code block according to 32 bits, then compresses and arranges the packed bits of all the code blocks, and copies the information bit data to be coded to a GPU global memory.
And 5: and calling a GPU function to carry out LDPC coding at the host side.
The method comprises the steps that the number of blocks grouped by a GPU virtual processor is equal to the maximum code block number num _ C, the block dimensionality is one-dimensional, and a plurality of blocks process LDPC code word codes of different code types in parallel. The host end sets the number of the starting threads of each block to be 256, and the dimension of the threads is one dimension.
And 6: and the GPU carries out preprocessing on the received information bits according to the coding indication information.
The method comprises the steps that the thread of each block of the GPU obtains the coding indication information of the corresponding code block from the corresponding structure body array, wherein the coding indication information comprises a basic graph BG, a lifting value Zc and the number of information bit groupsk b Checking matrix type B _ type, correction value remainder, start position src _ offset of information bit, bit storage start position dst _ offset after coding is finished, and then according to start position src _ offset of information bit and information bit group quantityk b And lifting value Zc to obtain information bitsStarting from the position indicated by src _ offset, to ceil (Zc;)k b The/32) × 4 bytes of space contains the information bits of the code block, ceil (, denotes rounding up.sRepresents a system sub-matrixACorresponding to the information bits received by the encoder, it is divided intok b A sub group of information bitss i ,(i= 1, 2, …, k b ) Each group corresponds to Zc bits. GPU to information bitsEach information bit subgroup ofs i And circularly filling to a multiple of 32 according to the corrected value remainder, wherein each subgroup corresponds to one circular block of the base matrix.
And 7: and the GPU carries the processed data information to the high-speed on-chip internal memory.
And 8: the GPU performs parallel encoding on the LDPC. The encoding of the LDPC code is performed using a check matrix H by solving equation Hc T =0 decoding of code wordscWhereinc T Representing a code wordcThe transposing of (1). The encoding stage comprises the following 3 steps:
step 8.1: grouping threads, establishing mapping relation between threads and information processing, and calculating by efficient shift operationk b A sub group of information bitss i Result of multiplication of corresponding bit with check matrix
Figure 269578DEST_PATH_IMAGE001
Figure 13543DEST_PATH_IMAGE002
The result will be calculatedm b The group elements are cached in the high-speed on-chip memory in preparation for the calculation of the parity bits in the following steps.
Step 8.2: calculating according to the coding indication informationFirst part of parity check bitsp a
First part parity check bitsp a Comprising 4 groups of check bits
Figure 984910DEST_PATH_IMAGE003
(z=1,2, 3, 4), so the first 4 layers of equations in the check matrix H need to be solved, the corresponding equation sets are listed according to the type B _ type of the check matrix H, and the information bit groups in step 8.1 are grouped intos i Multiplying the corresponding bit by check matrix, solving equation set according to the multiplication result, reading the first 4 sets of results of cache vector in the high-speed on-chip memory, and combining the equation to obtain the first part of check bitp a
Step 8.3: according to step 8.2p a Computing second partial parity bitsp c
The first part of parity bits is calculated in the manner in step 8.1p a Result of multiplication with check matrix H
Figure 788918DEST_PATH_IMAGE004
Figure 612780DEST_PATH_IMAGE005
Then will be
Figure 160436DEST_PATH_IMAGE006
And
Figure 986309DEST_PATH_IMAGE007
the corresponding groups are modulo-two added due to the second part of the parity bitsp c The corresponding check matrix is a sub-matrixIThe addition result is the second part of parity check bitsp c And (6) obtaining the result.
And step 9: the GPU compresses the encoded parity check bits and calculates the parity check bitspIn common withm b Groups, each group needs to delete the redundant bit according to the corrected value remainder to obtain Zc groups of each groupAnd the valid bit is compressed, and the compressed check bit result is written into the global memory according to the bit storage initial position dst _ offset.
Step 10: and transmitting the compression information in the global memory back to the host side.
Step 11: and the host end splices the received compression check information and the information bits to form complete LDPC coding information.
Wherein, the LDPC base matrix H of step 3 b The information includes: the 5G NR standard specifies two basis matrices, BG1 and BG2, which have the structure shown in formula (1) and the size ofm b ×n b m b Represents the number of coding layers, i.e. the number of rows of the base matrix,n b represents the number of columns of the base matrix,n b =m b +k b
Figure 351432DEST_PATH_IMAGE008
(1)
basis matrix H b The upper right 0 region indicates that all 0 s are in the part, whereAIs 4 ink b BThe dimension of (a) is 4 x 4,Chas the dimensions ofm b −4)×k b DHas the dimensions ofm b −4)×4,IIs of dimension (a)m b −4)×(m b -4). Basis matrix H b The offset H _ shift corresponding to all the cyclic blocks is included, the position H _ offset of the row of each cyclic block in a row is obtained through calculation, and all the basic information structures are all transmitted into a GPU memory before decoding.
Wherein, the check matrix type B _ type of step 4 includes:B4 x 4, which are 4 different types, the 5G protocol specifies fixed B _ types for different BG patterns.
Wherein, the correction value remainder in the step 4 comprises: setting different correction values remainder according to the magnitude of the lifting value Zc, according to Zc and 32The remainder determines the magnitude of the corrected value remainder, when Zc
Figure 770912DEST_PATH_IMAGE009
Corrected value remainder = Zc%32; when Zc<And a corrected value remainder = -Zc%32 at 32.remainder =0 represents that the information bits can be packed exactly into a set of unsigned 32-bit integer variables.
Wherein, the compression arrangement mode of step 4 includes: the PCI-E transmission speed is limited, when a group of different LDPC code blocks are coded, a fixed step length is selected, and the LDPC code (ZC) with a smaller lifting value ZC is coded
Figure 246892DEST_PATH_IMAGE010
256 Needs to be filled with a large amount of redundant data 0 during encoding, resulting in a long transmission time, so data is transmitted in a way of compressing data amount, and a set of offset information of the start position of the code block relative to the start position of the first code block is generated by the host end to indicate the position of each code block in a set of data, the offset of each code block relative to the start position of the first code block is the sum of the effective lengths of all the previous code blocks, and compared with a fixed step size, for a lifting value Zc=2, transmission can save the transmission data amount by 192 times at most.
Wherein, the bit packing of step 4 includes: because the information data is bits of 0 or 1, the information bits are packed into a group according to every 32 bits, GPU resources can be fully utilized, each thread in the GPU controls 32 bits to perform parallel computation, and the parallel computation efficiency is greatly improved.
Wherein the cyclic filling of step 6 comprises: for the sub-group of information bits according to the size of the correction value remainders i The middle Zc bit data is filled, because the memory size of a register in the GPU is 32 bits (4 bytes) and needs to be filled to integral multiple of 32, the information data stored by using 32 bits as basic symbols is formed, and therefore, each time, each Zc bit data is stored by using 32 bits as basic symbolss i Occupying M registers, M = ceil (Zc/32), the registers being denoted asR t t=1,2, …, M, the actual useful information data only covers M registersThe first Zc bits in the memory. Specifically, when remainder =0, zc is an integer multiple of 32, and the valid information of each group fills all M registers without filling any data; when remainders>When 0, it means that Zc is not an integer multiple of 32 and Zc>32,s i Does not fill up M registers, eachs i Starting to fill the same group after Zc bits actually occupieds i The header data bit is filled until the data in the last register is completed. When remainders<When 0, it represents Zc<32, each of the registers occupies one register, actual effective information data only covers the first Zc bits, and the Zc bits are cyclically filled until the 32 bits are filled. The purpose of the loop filling is to enable efficient operation of the data and reverse shift in the information subsets in subsequent steps.
And 8.1-8.3 are encoding stages, the encoding stages are combined into a kernel function to be executed, the synchronization overhead among thread blocks, the kernel function starting overhead and the access amount of the global memory are reduced, encoding intermediate information is put into the GPU high-speed on-chip memory, the access of the global memory is reduced, and the encoding speed is improved.
Wherein, the check matrix H in step 8 includes: the check matrix H is composed of a base matrix H b Expand Zc times. Basis matrix H b The offset indicated by a row element in (b) corresponds to the offset of the H matrix one-layer unit matrix Zc × Zc. The check matrix H can be represented by equation (2), where the subscript Zc represents the base matrix H b Zc-fold expansion over elements;
Figure 68218DEST_PATH_IMAGE011
(2)
wherein the LDPC encoding of step 8 comprises: the LDPC codeword may be divided into three parts,C=[s p a p c ]respectively correspond to the sub-matricesBAnd submatrixIEncoding of LDPC codes uses equation Hc T =0 execution.
Wherein the thread grouping in step 8.1 is specificThe method comprises the following steps: the GPU groups the 256 threads in each block into 16 groups, and since Zc is 384 at most and each subgroup contains 384 bits of data at most, each 16 threads are required to be a cooperative group, and each thread is responsible for the operation and calculation of 32 bits of data. In the process of processing the multi-code LDPC coding, the number of threads actually participating in operation in the cooperation group is ceil (Zc/32). Each cooperation group is responsible for solving the multiplication of the bits corresponding to all circulating blocks in one layer of the check matrix H and the check matrix, the solution of each layer finally calculates a group of results, 256 threads support the multiplication of the bits of at most 16 layers of the check matrix H and the check matrix to be simultaneously solved, 16 groups of results are calculated each time, and then all threads solve the operation of the following 16 layers until the operation is completedm b And solving the group result to finish.
Wherein the information bit subgroup in step 8.1s i Multiplying the corresponding bits by the check matrix includes, in particular, multiplying the result by the check matrix
Figure 840127DEST_PATH_IMAGE012
It is shown that the process of the present invention,s T to representsTranspose, matrix ofA Zc Sum matrixC Zc Composed of a plurality of cyclically-shifted blocks, thus a matrixA Zc And withs T Multiplication and matrixC Zc Ands T the multiplication can be expressed as
Figure 746903DEST_PATH_IMAGE013
A Zc And withs T The multiplication result of (A) is
Figure 760996DEST_PATH_IMAGE014
Matrix ofC Zc Ands T the result of multiplication is
Figure 702407DEST_PATH_IMAGE015
. Wherein the content of the first and second substances,iindicates the position of the information subgroup and also represents the number of columns of the cyclic shift block in the base matrix (i= 1, 2, …, k b ),jRepresents the number of rows of the cyclic shift block in the base matrix (j= 1, 2, …, m b ),
Figure 409332DEST_PATH_IMAGE016
Which represents a modulo two addition of the two,
Figure 662458DEST_PATH_IMAGE017
represents the first in the base matrixjLine, judge the firstiWhether the cyclic shift block of the column is 0, if 0,
Figure 355608DEST_PATH_IMAGE018
if the average molecular weight is not 0,
Figure 512045DEST_PATH_IMAGE019
thus, therefore, it is
Figure 265237DEST_PATH_IMAGE020
Indicates the information bit subgroup corresponding to the sub-block other than 0s i (ii) a When the cyclic shift block is not 0, it has a cyclic shift amount
Figure 271240DEST_PATH_IMAGE017
Represents a subgroup of information bitss i Should be subjected to a cyclic shift of magnitude, and therefore
Figure 768080DEST_PATH_IMAGE021
Show that
Figure 543138DEST_PATH_IMAGE022
Is/are as follows
Figure 936073DEST_PATH_IMAGE017
And (5) cyclic displacement results.
Wherein the efficient shifting of step 8.1 comprises: and traversing each layer, reading the base matrix offset H _ shift by the active working threads in the cooperation group, wherein the label of each thread is tid. When remainders
Figure 429371DEST_PATH_IMAGE023
At time 0, the active worker thread in the cooperative group is responsible for reading two 32-bit information data in the cache region on the high-speed chip, and reading two information elements E from the position lid and the position hid lid And E hid Wherein the position of the lid is determined according to the offset H _ shift, lid = (tid + H _ shift)% M, the position of the lid is determined by the lid, and lid = (lid + 1)% M, if remainder =0, the thread in the cooperative group directly merges the two information elements into 64 bits [ E [ lid |E hid ]Moving H _ shift%32 bits to the left by an instruction, and then reserving the 32 bits on the left side as output; if remaining renderer>When 0 is needed, threads in the cooperation group are divided into two operations, the first part of threads are hid% M =0, the register position where the lid is located is the last one of the M registers, the hid is the first register, the two registers are not continuous in space position, the position element of the hid needs to be shifted to the left by 32-remainders to form a new hid position element, and then the two information elements are combined into 64 bits [ E [ lid |E hid ]Second partial thread hid% M>0, representing that the lid and the hid are continuous, and the calculation process is consistent with the condition that remainder =0, forming [ E lid |E hid ]Finally, the pass instruction will [ E lid |E hid ]Shift H _ shift%32 bits to the left, and then leave the left 32 bits as output. When remainders<0, the active working thread in the cooperative group is responsible for reading data of 32-bit information in the cache region on the high-speed chip, and reading an information element E from the position id id Wherein when Zc
Figure 464324DEST_PATH_IMAGE024
Figure 464324DEST_PATH_IMAGE024
16, directly moving H _ shift%32 bits to left by instruction, when Zc>16, the information element E needs to be replaced id Circularly moving the remainder bit to the right to obtain E id ', merging two information elements into 64 bits [ E ] lid |E hid ]By instruction will [ E lid |E hid ]Shift H _ shift%32 bits to the left, leaving the left 32 bits as the output result.
Wherein, step 8.2 lists the corresponding equation set according to the check matrix H type B _ type, and comprises: by using
Figure 583634DEST_PATH_IMAGE025
(z=1,2, 3, 4) represents the first part of parity bitsp a The (c) of (a) the check subgroup,
Figure 537684DEST_PATH_IMAGE026
to represent
Figure 393644DEST_PATH_IMAGE027
Is
Figure 356921DEST_PATH_IMAGE028
The result of the bit (left shift) cyclic shift,
Figure 716358DEST_PATH_IMAGE029
i.e. buffered in step 8.1
Figure 841309DEST_PATH_IMAGE030
Modulo two addition of the first 4 groups of information bits. Check subgroup
Figure 918986DEST_PATH_IMAGE031
When known, the formula is obtained by an equation system
Figure 187419DEST_PATH_IMAGE032
,
Figure 666942DEST_PATH_IMAGE033
,
Figure 962794DEST_PATH_IMAGE034
;
Wherein the parity bits of step 8.3p a The multiplication with H calculation includes: computing second part parity check bitsp c According to the calculation in step 8.2p a By the efficient shift operation of step 8.1, willp a Multiplying the position corresponding to the check matrix to obtain
Figure 652401DEST_PATH_IMAGE035
. Will be provided with
Figure 832847DEST_PATH_IMAGE036
And
Figure 291510DEST_PATH_IMAGE037
5 th to 5 th ofm b Adding the corresponding positions to obtain a second part of parity check bitsp c The calculation result of (2).
The step 9 of compressing the valid bits includes that the encoded 5GLDPC code is a systematic code, and the code word of the encoded information bits is long, and the GPU device needs to consume a large amount of time to transmit the code word back to the host, so that only the encoded check bits in the cache on the cache chip are partially transported to the global memory by using the encoded information compression method, thereby saving a large amount of transport time. When the remainder is 0, the Zc is an integral multiple of 32, and the information bits cached on the cache chip are continuously copied to the global memory; when remainder is not 0, zc is not an integer multiple of 32, bits filled after Zc bits in each information sub-group need to be deleted, each Zc only contains the first Zc information bits, and all bits are shifted and combined to restore to a tight information bit arrangement. And then, data packing is carried out on the check information bits which are arranged closely according to 32 bits as a group, the tail part of the check information bits which are not 32 bits is filled with 0, and the packed data is written into the position indicated by dst _ offset.
Compared with the prior art, the invention has the advantages and positive effects that:
the invention flexibly divides corresponding computing resources for each code block at a software logic level, supports the simultaneous encoding of the code blocks with different code length and code rate, can effectively reduce the data transmission overhead between a host and equipment in the process of encoding large-scale LDPC code blocks, and improves the flexibility and the practical value of the encoder.
The method fully combines the characteristics of the LDPC coding algorithm and the architecture characteristics of the GPU, fully utilizes GPU resources, improves the access efficiency and the utilization rate of a data computing unit, increases the instruction throughput of operation in a bit packing mode, reduces the resource consumption of a single code block, improves the coding parallelism of the single code block and improves the overall information throughput.
Drawings
FIG. 1 is a schematic diagram of the structure of a GPU-based LDPC high-speed encoder of the present invention.
FIG. 2 is a flow chart of the LDPC high-speed encoding method based on the GPU of the present invention.
Fig. 3 is a schematic diagram of an information compression arrangement according to the present invention.
Fig. 4 is a schematic diagram of bit packing according to the present invention.
FIG. 5 is a schematic diagram of bit-cycling padding according to the present invention.
FIG. 6 is a diagram illustrating bit efficient shifting according to the present invention.
FIG. 7 is a diagram illustrating the compression of the valid code bits according to the present invention.
Fig. 8 is a schematic diagram of splicing compressed parity bits and information bits according to the present invention.
Detailed Description
The invention is explained in detail below with reference to fig. 1 to 8 and the exemplary embodiments.
The invention provides a 5G LDPC code high-speed encoder based on a GPU. FIG. 1 shows a schematic diagram of a high-speed encoder, in whichN sm Representing the number of the streaming multiprocessors on the GPU chip, wherein the encoder structure comprises a host end and a GPU equipment end; the method comprises the steps that a host memory and a CPU chip are arranged at a host end, and the CPU chip is used for preprocessing coded information, combining code words, controlling and scheduling the whole coding process and sending an information stream to be coded to a GPU (graphic processing unit) device end; the GPU equipment end is provided with a GPU chip, the GPU chip is composed of a plurality of Stream Multiprocessors (SM), each SM is responsible for coding a group of LDPC code blocks with different code types, a logic unit of each SM comprises a global memory, a constant memory, a high-speed on-chip memory, a register memory and the like, and the high-speed on-chip memory stores intermediate calculation information in the coding process, so that the time delay of updating, accessing and storing information layer by layer is reduced; the register memory is used for storing temporary variables such as intermediate quantity and the like generated in the process of calculating memory access, and a large number of threads started in the processor carry out high-speed parallel coding on information flow; the host side and the GPU equipment side carry out data transmission through a high-speed serial computer expansion bus (PCI-E).
The invention also provides a coding method of the 5GLDPC coder based on the GPU, the whole coding process can be divided into 11 steps, a flow chart of the method is shown in figure 2, and the specific coding steps are as follows:
step 1: and initializing the storage space of the host end.
Allocating enough storage space for information bits and coded bits at a host end according to the maximum code block number num _ C processed simultaneously;
step 2: and initializing a storage space of the GPU device.
The GPU side configures memory space, allocates enough global memory space for information bits and coded bits, and allocates memory space on a high-speed chip to the GPU according to the maximum occupied resource of the current coding;
and 3, step 3: and initializing LDPC check matrix information at the GPU equipment end.
At the GPU equipment end, all LDPC basic matrixes H specified by 5G protocol b And pre-storing the information, and writing the offset information H _ shift in the base matrix and the position H _ offset of the column of the circulating block in each row into a GPU global memory. Basis matrix H b Comprises a sub-matrixABCDI
And 4, step 4: the host copies the coding indication information and the data information to the GPU.
The host end writes the coding information of each code block into a structural body, wherein the coding information comprises a basic graph BG, a lifting value Zc and the number of information bit groupsk b Check matrix type B _ type, number of coding layersm b And a corrected value remainder. The method comprises the steps that information data and coding bits are transmitted between a host end and an equipment end in a compression arrangement mode, the structure body further comprises a starting position src _ offset of each code block information bit and a bit storage starting position dst _ offset after coding is completed, coding indication information structure bodies of a plurality of code blocks form a coding indication information structure body array, and the host end transmits all indication information structure body arrays waiting for parallel processing of the code blocks to a GPU global memory. The host end packs the information bits of each code block according to 32 bits, and then packs the information bitsAnd (4) carrying out packed bit compression arrangement on code blocks, and copying information bit data to be coded to a GPU global memory.
And 5: and calling a GPU function to carry out LDPC coding at the host side.
The method comprises the steps that the number of blocks grouped by a GPU virtual processor is equal to the maximum code block number num _ C, the block dimensionality is one-dimensional, and a plurality of blocks process LDPC code word codes of different code types in parallel. The host end sets the number of the starting threads of each block to be 256, and the dimension of the threads is one dimension.
Step 6: and the GPU carries out preprocessing on the received information bits according to the coding indication information.
The method comprises the steps that the thread of each block of the GPU obtains coding indication information of a corresponding code block from a corresponding structural body array, wherein the coding indication information comprises a basic graph BG, a lifting value Zc and the number of information bit groupsk b Checking matrix type B _ type, correction value remainder, start position src _ offset of information bit, bit storage start position dst _ offset after coding is finished, and then according to start position src _ offset of information bit and information bit group quantityk b And lifting value Zc to obtain information bitsStarting from the position indicated by src _ offset, to ceil (Zc;)k b The/32) × 4 bytes of space contain the information bits of the code block, ceil (, indicates rounding up.sRepresents a sub-matrixACorresponding to the information bit part received by the encoder, it is divided intok b A sub group of information bitss i ,(i= 1, 2, …, k b ) Each group corresponds to Zc bits. GPU to information bitsEach information bit subgroup ofs i And circularly filling to a multiple of 32 according to the corrected value remainder, wherein each information bit subgroup corresponds to one circular block of the base matrix.
And 7: and the GPU carries the processed data information to the high-speed on-chip internal memory.
And step 8: the GPU codes the LDPC in parallel. The encoding of the LDPC code is performed using a check matrix H, which is solved by solving an equation to obtain a codewordcThe encoding stage comprises the following 3 steps:
step 8.1:grouping threads, establishing mapping relation between threads and information processing, and calculating by efficient shift operationk b A sub-group of information bitss i Result of multiplication of corresponding bits with check matrix
Figure 899209DEST_PATH_IMAGE038
Figure 311998DEST_PATH_IMAGE039
The result will be calculatedm b The group elements are cached in the high-speed on-chip memory in preparation for the later steps to calculate the parity bits.
Step 8.2: calculating a first part of parity check bits according to the coding indication informationp a
First part parity check bitsp a Comprising 4 groups of check bits
Figure 420768DEST_PATH_IMAGE003
(z=1,2, 3, 4), so the first 4 layers of equations in the check matrix H need to be solved, the corresponding equation set is listed according to the type B _ type of the check matrix H, and the information bits in step 8.1 are grouped into groupss i Multiplying the corresponding bit by check matrix, solving equation set according to the multiplication result, reading the first 4 sets of results of vector result cached in the high-speed on-chip memory, and combining the equation to obtain the first part of parity check bitp a
Step 8.3: according to step 8.2p a Computing second partial parity bitsp c
The first part of parity bits is calculated in the manner in step 8.1p a Result of multiplication with check matrix H
Figure 999517DEST_PATH_IMAGE006
Figure 512538DEST_PATH_IMAGE040
Then will be
Figure 678202DEST_PATH_IMAGE041
And with
Figure 59505DEST_PATH_IMAGE042
The corresponding groups are modulo-two added due to the second part of the parity bitsp c The corresponding check matrix is a sub-matrixIThe addition result is the second part of parity check bitsp c And (6) obtaining the result.
And step 9: the GPU compresses the encoded parity check bits and calculates the parity check bitspIn common withm b And each group needs to delete the redundant bits according to the corrected value remainder to obtain Zc effective bits of each group, compress the effective bits, and write the result of the compressed check bits into the global memory according to the bit storage initial position dst _ offset after the coding is finished.
Step 10: and transmitting the compression information in the global memory back to the host side.
Step 11: and the host end splices the received compression check information and the information bits to form complete LDPC coding information. As shown in fig. 8, the host side sends the information bits to the GPU side through the PCI-E bus, the GPU side performs high-speed encoding, and then returns only the check bits to the host side through the PCI-E bus, and the host side then splices the information bits and the check bits to form a complete codeword.
The calculation of the corrected value remainder in the step 4 specifically comprises the following operations: setting different correction values remainder according to the magnitude of the lifting value Zc, determining the magnitude of the correction value remainder according to the remainders of Zc and 32, and when Zc is greater than the value
Figure 368127DEST_PATH_IMAGE043
Corrected value remainder = Zc%32; when Zc<Corrected value remainder at 32 hours=-Zc%32。remainder=0 represents that the information bits can be packed exactly into a set of unsigned 32-bit integer variables. For example, zc=80
Figure 442262DEST_PATH_IMAGE044
32,remainder= Zc%32= 16。
The specific operation of the compression arrangement in the step 4 is as follows: the PCI-E transmission speed is limited, a fixed step size is selected when a group of different LDPC code blocks are coded, and the LDPC code (ZC) with a smaller lifting value ZC is used
Figure 937965DEST_PATH_IMAGE045
256 Needs to be filled with a large amount of redundant data 0 during encoding, resulting in a long transmission time, so data is transmitted in a way of compressing data amount, and a set of offset information of the start position of the code block relative to the start position of the first code block is generated by the host end to indicate the position of each code block in a set of data, the offset of each code block relative to the start position of the first code block is the sum of the effective lengths of all the previous code blocks, and compared with a fixed step size, for a lifting value Zc=2, transmission can save the transmission data amount by 192 times at most. Dst _ offset as shown in FIG. 3 n 、P n 、Z n Respectively shows the start position, byte size, lifting value size, P of the nth code block n =ceil(m b *Zc n /32) × 4 bytes, each block storing only its own valid byte P n The 0's of the other redundant positions are not stored, and the start position dst _ offset of each code block n All pass through the start position dst _ offset of the previous code block n-1 Plus byte P of the previous code block n-1 The obtained sum is the effective length of all the code blocks, the compression mode hardly generates redundant bits, and the transmission data volume is reduced.
The specific operation of bit packing in step 4 is as follows: as the information data is bits of 0 or 1, the information bits are packed in groups of 32 bits, as shown in fig. 4, for a string of information bit data [ 01 0.. 0.1.1 0.. 0.0.1.0.. 0.1.1.0.. 0.. 0.1.0.. 1], each 32 bits are divided into one group, and the bits in each group are just put into one register, so that GPU resources can be fully utilized, and each thread in the GPU controls 32 bits to perform parallel computation, thereby greatly increasing the parallel computation efficiency.
The specific operation of the cyclic filling in the step 6 is as follows: for the sub-group of information bits according to the size of the correction value remainders i The middle Zc bit data is filled, because the memory size of a register in the GPU is 32 bits (4 bytes) and needs to be filled to integral multiple of 32, the information data stored by using 32 bits as basic symbols is formed, and therefore, each time, each Zc bit data is stored by using 32 bits as basic symbolss i Occupy M registers, M = ceil (Zc/32), register denoted asR t t=1,2, …, M, actual valid information data covers only the first Zc bits in the M registers. Specifically, when remainder =0, zc is an integer multiple of 32, and the valid information of each group fills all M registers without filling any data; when remainders>When 0, it means that Zc is not an integer multiple of 32 and Zc>32,s i Does not fill up M registers, eachs i Starting to fill the same group after Zc bits actually occupieds i The header data bit is filled until the data in the last register is completed. When remainders<When 0, it represents Zc<32, each occupies one register, the actual effective information data only covers the first Zc bits, and the Zc bits are circularly filled until the 32 bits are filled. The purpose of the loop filling is to enable efficient operation of the data and reverse shift in the information subsets in subsequent steps. As shown in fig. 5, a first subset of information bits is showns 1 When Zc =80 and remaining =16, the bit index is [0,1,2,3, …,79 [ ]]Since floor (80/32) = 2, 80% = 32, the first 2 32-bit symbols are normally filled with an index of [0,1,2, …,63 = 32-bit symbol]The 3rd group of symbols is divided into two parts, each part is 16 bits, the first half part of the 3rd symbol is filled with indexes of [64,65, …,79]The last half of the 3rd symbol is filled with an index [0,1, …, 15%]16 bits of information.
The LDPC encoding in step 8 specifically operates as follows: the LDPC codeword may be divided into three parts,C=[s p a p c ]whereinsRepresenting a sub-matrixACorresponding to the information bits received by the encoder, respectively corresponding to the sub-matricesBAndIthe encoding of the LDPC code is performed using equation (3):
Figure 388538DEST_PATH_IMAGE046
(3)
then, equation (3) is naturally divided into equation (4) and equation (5):
Figure 912186DEST_PATH_IMAGE047
(4)
Figure 767009DEST_PATH_IMAGE048
(5)
in the formulae (4) and (5)TRepresenting the transpose of the matrix.
In step 8.1, the specific operation of thread grouping is as follows: the GPU groups the 256 threads in each block into 16 groups, and since Zc is 384 at maximum and each subgroup contains 384 bits of data at maximum, each 16 threads are required to be a cooperation group, and each thread is responsible for 32 bits of data operation and calculation. In the process of processing the multi-code LDPC coding, the number of threads actually participating in operation in the cooperation group is ceil (Zc/32). Each cooperation group is responsible for solving the multiplication of the bits corresponding to all cyclic shift blocks of one layer in the check matrix H and the check matrix, the solution of each layer finally calculates a group of results, 256 threads support the multiplication of the bits of 16 layers at most in the check matrix H and the check matrix and simultaneously solve, 16 groups of results are calculated each time, and then all threads solve the operation of the following 16 layers until the operation is completedm b And solving the group result to finish.
Wherein, in step 8.1, the information bits are groupeds i The specific operation of multiplying the check matrix is as follows: result of multiplication by
Figure 405801DEST_PATH_IMAGE042
It is shown that,s T representsTranspose, matrix ofA Zc Sum matrixC Zc By a plurality ofA cyclic shift block, thus a matrixA Zc Ands T multiplication and matrixC Zc Ands T the multiplication can be expressed as
Figure 128906DEST_PATH_IMAGE049
A Zc And withs T Is multiplied by
Figure 146541DEST_PATH_IMAGE050
Matrix ofC Zc Ands T the multiplication result is
Figure 562479DEST_PATH_IMAGE051
. Wherein the content of the first and second substances,iindicates the position of the information subgroup and also represents the number of columns of the cyclic shift block in the base matrix (i= 1, 2, …, k b ),jRepresenting the number of rows of the cyclic shift block in the base matrix: (j= 1, 2, …, m b ),
Figure 298354DEST_PATH_IMAGE052
Which represents a modulo two addition of the two,
Figure 592194DEST_PATH_IMAGE053
represents the first in the base matrixjGo to, judgeiWhether the cyclic shift block of the column is 0, if 0,
Figure 323390DEST_PATH_IMAGE054
if the average molecular weight is not 0,
Figure 379070DEST_PATH_IMAGE055
thus, therefore, it is
Figure 461296DEST_PATH_IMAGE056
Indicates the information bit subgroup corresponding to the non-0 sub-blocks i (ii) a When the cyclic shift block is not 0, it has cyclic shift amount
Figure 667149DEST_PATH_IMAGE057
Represents a subgroup of information bitss i Should be subjected to a cyclic shift of magnitude, and therefore
Figure 19895DEST_PATH_IMAGE058
Show that
Figure 918581DEST_PATH_IMAGE059
Is/are as follows
Figure 222524DEST_PATH_IMAGE060
And (5) cyclic displacement results.
The specific operation of the efficient shift of step 8.1 is as follows: and traversing each layer according to the degree, reading the offset H _ shift of the base matrix by the effectively working threads in the cooperation group, wherein the label of each thread is tid. When remainders
Figure 356702DEST_PATH_IMAGE061
0, the active working thread in the cooperative group is responsible for reading two 32-bit information data in the cache region on the high-speed chip, and two information elements E read from the position lid and the position hid lid And E hid Wherein the position of the lid is determined according to the offset H _ shift, lid = (tid + H _ shift)% M, the position of the lid is determined by the lid, and lid = (lid + 1)% M, if remainder =0, the thread in the cooperative group directly merges the two information elements into 64 bits [ E [ lid |E hid ]Moving H _ shift%32 bits to the left by an instruction, and then reserving the 32 bits on the left side as output; if remaining renderer>When 0 is needed, threads in the cooperation group are divided into two operations, the first part of threads are hid% M =0, the register position where the lid is located is the last one of the M registers, the hid is the first register, the two registers are not continuous in space position, the position element of the hid needs to be shifted to the left by 32-remainders to form a new hid position element, and then the two information elements are combined into 64 bits [ E [ lid |E hid ]Second partial thread hid% M>0, representing that the lid and the hid are continuous, and the calculation process is consistent with the condition that remainder =0, forming [ E lid |E hid ]Finally, the pass instruction will [ E lid |E hid ]Left shift H _ shift%32 bits and then reserveThe left 32 bits are output. When remainders<0, the active worker thread in the cooperative group is responsible for reading a 32-bit message of data in the cache on the cache, and reading the message element E from the location id id Wherein when Zc
Figure 203435DEST_PATH_IMAGE062
When 16, the left is moved by 32 bits H _ shift% directly by the instruction, when ZC>16, the information element E needs to be replaced id Circularly moving the remainder bit to the right to obtain E id ', merging two information elements into 64 bits [ E ] lid |E hid ]By instruction will [ E lid |E hid ]Shift H _ shift%32 bits to the left, leaving the left 32 bits as the output result.
Fig. 6 illustrates the first group when remaining =16s 1 When Zc =80, the first groups 1 The index is [0,1,2, …,79, 0,1, …,15]The method occupies M =3 registers, and data is sequentially loaded in the registersm 1m 2m 3 Above, 3 active worker threads tid in each collaboration group: 0. 1,2 are responsible for reading two 32-bit information data in the cache on-chip, two information elements E read from location lid and location hid lid And E hid Where the position of lid is determined by the offset H _ shift, H _ shift =10 in fig. 6, lid = (tid + H _ shift)% M, the lid position is determined by lid, and lid = (lid + 1)% M, so thread 0 loadsm 1m 2 Data, thread 1 loadm 2m 3 Data, thread 2 loadm 3m 1 The data of (1). The thread 2 satisfies hid% M =0, which indicates that the register position of the thread 2lid is the last register of the M registers, and hid is the first register, and the two are spatially discontinuous, so that the position element of hid is shifted to the left by 16 to form a new hid position element, and then the two information elements are combined into 64 bits [ E =0 ] lid |E hid ](ii) a Thread 0, thread 1 satisfy hid% M>0, representing that the lid and the hid are continuous, and the calculation process is consistent with the condition that remainder =0, forming [ E lid |E hid ]. Finally the pass instruction will [ E lid |E hid ]Shift H _ shift%32=10 bits to the left, and then leave the 32 bits on the left as output, resulting in the output result as shown in fig. 6.
Wherein, step 8.2 lists the corresponding equation set according to the H matrix type B _ type, including: writing equation (4) in block form can be extended to the following equation set (6):
Figure 132077DEST_PATH_IMAGE063
(6)
wherein the content of the first and second substances,
Figure 690359DEST_PATH_IMAGE064
to representp a The check sub-group of (1),
Figure 238015DEST_PATH_IMAGE065
to represent
Figure 63889DEST_PATH_IMAGE066
Is/are as follows
Figure 38798DEST_PATH_IMAGE067
(left shift) cyclic shift result. Obtaining a check subgroup by adding all the equations
Figure 582912DEST_PATH_IMAGE068
As a result of which,
Figure 58893DEST_PATH_IMAGE029
i.e. buffered in step 8.1
Figure 614639DEST_PATH_IMAGE069
Modulo two addition of the first 4 groups of information bits. Check subgroup
Figure 663846DEST_PATH_IMAGE070
When known, the formula is obtained by an equation system
Figure 429677DEST_PATH_IMAGE071
,
Figure 319135DEST_PATH_IMAGE072
,
Figure 119601DEST_PATH_IMAGE073
;
Wherein, step 8.3p a The specific operation of the multiplication by H calculation is: calculating the second part of parity bits using equation (5)p c According to the calculation in step 8.2p a By the efficient shift operation of step 8.1, willp a Multiplying the position corresponding to the check matrix to obtain
Figure 436313DEST_PATH_IMAGE035
. Will be provided with
Figure 689440DEST_PATH_IMAGE074
And
Figure 8688DEST_PATH_IMAGE037
5 th to 5 th ofm b Adding the corresponding positions to obtain a second part of parity check bitsp c The calculation result of (2).
The specific operation of compressing the effective coding bits in step 9 is that the encoded 5GLDPC code is a systematic code, and the code word after the information bit is encoded is long, and the GPU device needs to consume a large amount of time to transmit the code word back to the host, so that only the encoded check bits in the cache on the cache chip are partially transported to the global memory by using the encoded information compression method, thereby saving a large amount of transport time. When the remainder is 0, the Zc is an integral multiple of 32, and the information bits cached on the cache chip are continuously copied to the global memory; when remainder is not 0, zc is not an integer multiple of 32, bits filled after Zc bits in each information sub-group need to be deleted, each Zc only includes the first Zc information bits, and then all bits are shifted and combined to restore to a tight information bit arrangement.
As shown in fig. 7, zc =80, remaider =16, taking the first group of information bits as an example, the index is [0,1,2,3, …,79], since floor (80/32) = 2, 80, 32=16, the first 2-bit symbol is normally carried, the filled index is [0,1,2, …,63], the 3rd symbol is split into two parts, each part is 16 bits, the 3rd symbol first half part is filled with 16-bit information with the index of [64,65, …,79], and the 3rd symbol second half part is filled with 16-bit information with the next group of information bits with the index of [0,1, …,15 ]. And then, data packing is carried out on the check information bits which are arranged closely according to 32 bits as a group, the tail of the check information bits which are not enough for 32 bits is filled with 0, and the packed data is written into the dst _ offset indication position.

Claims (9)

1. A coding method of a 5GLDPC coder based on a GPU is disclosed, wherein the coder comprises a host end and a GPU equipment end; the method comprises the steps that a host memory and a CPU chip are arranged at a host end, and the CPU chip is used for preprocessing coded information, combining code words, controlling and scheduling the whole coding process and sending an information stream to be coded to a GPU (graphic processing unit) device end; the GPU equipment end is provided with a GPU chip, the GPU chip consists of a plurality of stream multiprocessors SM, each SM is responsible for coding a group of low-density parity check code LDPC code blocks with different code types, and a large number of threads started in the SM carry out high-speed parallel coding on information streams; the host side and the GPU equipment side carry out data transmission through a high-speed serial computer expansion bus PCI-E;
based on above-mentioned encoder, its characterized in that: the encoding method comprises the following specific steps:
step 1: initializing a storage space of a host end;
allocating enough storage space for information bits and coded bits at a host end according to the maximum code block number num _ C processed simultaneously;
step 2: initializing a storage space of a GPU (graphics processing unit) device end;
the GPU equipment side configures a memory space, allocates enough global memory space for information bits and coded bits, and allocates memory space on a high-speed chip to the GPU according to the maximum occupied resource of the current coding;
and 3, step 3: initializing LDPC check matrix information of a GPU device end;
in GPU deviceThe backup end applies all LDPC base matrix H specified by the 5G protocol b Pre-storing the information, and writing the offset information H _ shift in the base matrix and the position H _ offset of the row where the circulating block in each row is located into a global memory of the GPU; basis matrix H b Comprises a matrixABCDI
And 4, step 4: copying coding indication information and data information from a host to a GPU;
the host end writes the coding information of each code block into a structural body, which comprises a basic graph BG, a lifting value Zc and the number of information bit groupsk b Check matrix type B _ type, number of coding layersm b Corrected value remainder; the method comprises the steps that information data and coding bits are transmitted between a host end and an equipment end in a compression arrangement mode, a starting position src _ offset of each code block information bit and a bit storage starting position dst _ offset after coding are further included in a structure body, coding indication information structure bodies of a plurality of code blocks form a coding indication information structure body array, and the host end transmits all indication information structure body arrays waiting for parallel processing of the code blocks to a GPU global memory; the host end carries out bit packing on the information bits of each code block according to 32 bits, then the packed bits of all the code blocks are compressed and arranged, and the information bit data to be coded are copied to a GPU global memory;
and 5: calling a GPU function at a host side to perform LDPC coding;
the method comprises the steps that a host end sets the number of blocks grouped by a GPU virtual processor to be equal to the maximum code block number num _ C, the block dimensionality is one-dimensional, and a plurality of blocks process LDPC code word codes of different code types in parallel; the host end sets the number of the starting threads of each block to be 256, and the thread dimension is one-dimensional;
step 6: the GPU carries out preprocessing on the received information bits according to the coding indication information;
the method comprises the steps that the thread of each block of the GPU obtains coding indication information of a corresponding code block from a corresponding structural body array, wherein the coding indication information comprises a basic graph BG, a lifting value Zc and the number of information bit groupsk b Check matrix type B _ type, corrected value remaining, start of information bitStarting position src _ offset, bit-storing starting position dst _ offset after encoding is completed, and then according to the starting position src _ offset of information bits, number of information bit groupsk b And lifting value Zc acquisition information bitsStarting from the position indicated by src _ offset, to ceil (Zc;)k b -32) × 4 bytes of space containing the information bits of the code block, ceil (—) representing rounding up; is divided intok b A sub group of information bitss i , i= 1, 2, …, k b Each group corresponds to Zc bits; GPU to information bitsEach information bit subgroup ofs i Circularly filling to a multiple of 32 according to the corrected value remainder, wherein each subgroup corresponds to a circular block of the base matrix;
and 7: the GPU carries the processed data information to an on-chip internal memory of a high speed chip;
and 8: the GPU carries out parallel coding on the LDPC; encoding of LDPC codes is performed using a check matrix H by solving equation Hc T =0 decoding of code wordscWherein, in the step (A),c T representing a code wordcTransposing; the encoding stage comprises the following 3 steps in total:
step 8.1: grouping threads, establishing mapping relation between threads and information processing, and calculating by efficient shift operationk b A sub group of information bitss i Result of multiplication of corresponding bits with check matrix
Figure 263895DEST_PATH_IMAGE001
Figure 79404DEST_PATH_IMAGE002
(ii) a Will calculate the resultm b Caching the group elements into an internal memory on a high-speed chip to prepare for calculating parity check bits in the following steps;
step 8.2: calculating a first portion of parity bits based on the coding indication informationp a
First part parity check bitsp a Comprising 4 groups of check bits
Figure 164035DEST_PATH_IMAGE003
z=1,2, 3, 4; solving the front 4 layers of equations in the check matrix H, listing the corresponding equation set according to the type B _ type of the check matrix H, and grouping the information bits in the step 8.1s i Multiplying the corresponding bit by check matrix, solving equation set according to the multiplication result, reading the first 4 sets of results of vector result cached in the high-speed on-chip memory, and solving the first part of parity check bitp a
Step 8.3: according to step 8.2p a Computing second partial parity bitsp c
The first part of parity bits is calculated in the manner in step 8.1p a Result of multiplication with check matrix H
Figure 847826DEST_PATH_IMAGE004
Figure 560567DEST_PATH_IMAGE002
Then will be
Figure 687923DEST_PATH_IMAGE005
And
Figure 756722DEST_PATH_IMAGE006
the corresponding groups are modulo-two added due to the second part of the parity bitsp c The corresponding check matrix is a sub-matrixIThe addition result is the second part of parity check bitsp c The result is;
and step 9: the GPU compresses the encoded parity check bits and calculates the parity check bitspIn common withm b Each group needs to delete the redundant bit according to the correction value remainder to obtain Zc effective bits of each group, the effective bits are compressed, and the compressed check ratio is compared according to the bit storage initial position dst _ offsetWriting the special result into a global memory;
step 10: transmitting the compression information in the global memory back to the host end;
step 11: and the host end splices the received compression check information and the information bits to form complete LDPC coding information.
2. The encoding method of a GPU-based 5GLDPC encoder according to claim 1, wherein: in step 3, the LDPC basis matrix H b The information includes: the 5G NR standard specifies two basic matrixes, namely BG1 and BG2, and has a structure shown in a formula (1) and a size ofm b ×n b m b Represents the number of coding layers, i.e. the number of rows of the base matrix,n b represents the number of columns of the base matrix,n b =m b +k b
Figure 119570DEST_PATH_IMAGE007
(1)
basis matrix H b The upper right 0 region indicates that all 0 s are in the part, whereAIs 4 dimensionk b BThe dimension of (a) is 4 x 4,Chas the dimensions ofm b −4)×k b DHas the dimensions ofm b −4)×4,IHas the dimensions ofm b −4)×(m b -4); basis matrix H b The offset H _ shift corresponding to all the cyclic blocks is included, the position H _ offset of the row of each cyclic block in a row is obtained through calculation, and all the basic information structures are all transmitted into a GPU memory before decoding.
3. The encoding method of a GPU-based 5GLDPC encoder according to claim 1, wherein: in step 4, the check matrix type B _ type includes: sub-matrixBIs a dimensional 4 x 4 square matrix with 4 different types, the 5G protocol specifiesA fixed B _ type corresponding to a BG pattern;
the correction value remainder includes: setting different correction values remainder according to the magnitude of the lifting value Zc, determining the magnitude of the correction value remainder according to the remainders of Zc and 32, and when Zc is greater than the value
Figure 562184DEST_PATH_IMAGE008
Corrected value remainder = Zc%32; when Zc<At 32, the corrected value remainder = -Zc%32; remainder =0 represents the packing of information bits into a set of unsigned 32-bit integer variables;
the way of compressing the arrangement comprises: the PCI-E transmission speed is limited, when a group of different LDPC code blocks are coded, a fixed step length is selected, when the LDPC code with the lifting value Zc is coded, a large amount of redundant data 0 needs to be filled, and the transmission time is long, so that data is transmitted in a data volume compression mode, a group of offset information of the starting position of the code block relative to the starting position of the first code block is generated through a host end and used for indicating the position of each code block in the group of data, and the offset of each code block relative to the starting position of the first code block is the sum of the effective lengths of all the code blocks in front;
the bit packing comprises the following steps: because the information data is bits of 0 or 1, the information bits are packed into a group according to every 32 bits, GPU resources can be fully utilized, and each thread in the GPU controls 32 bits to perform parallel computation.
4. The encoding method of a GPU-based 5GLDPC encoder according to claim 1, wherein: in step 6, the cyclic filling comprises: for the sub-group of information bits according to the size of the correction value remainders i And filling middle Zc bit data, wherein the memory size of a register in the GPU is 32 bits and needs to be filled to be integral multiple of 32 bits to form information data stored by using 32 bits as basic symbols, so that each information data is stored by using 32 bits as basic symbolss i Occupying M registers, M = ceil (Zc/32), the registers being denoted asR t t=1,2, …, M, actual valid information data covers only the first Zc bits in the M registers; when remainder =0, zc is an integer of 32The effective information of each group fills all M registers without filling any data; when remainders>When 0, it means that Zc is not an integer multiple of 32 and Zc>32,s i Does not fill up M registers, eachs i Starting to fill the same group after Zc bits actually occupieds i The data bit of the middle head part is added until the data in the last register is completed; when remainders<When 0, it represents Zc<32, each occupies one register, the actual effective information data only covers the first Zc bits, and the Zc bits are circularly filled until the 32 bits are filled.
5. The encoding method of a GPU-based 5GLDPC encoder according to claim 1, wherein: in step 8, the check matrix H is formed by the base matrix H b Expanding to obtain; basis matrix H b The offset indicated by the element in the row corresponds to the offset of a unit matrix ZC multiplied by ZC of a layer of the H check matrix; the check matrix H is expressed as formula (2), wherein the subscript Zc represents the basis matrix H b Zc-fold expansion on elements
Figure 109709DEST_PATH_IMAGE009
(2)
The LDPC encoding includes: the LDPC codeword is divided into three parts,C=[s p a p c ]respectively corresponding to the sub-matricesBAnd submatrixIEncoding of LDPC codes uses equation Hc T =0 execution.
6. The encoding method of a GPU-based 5GLDPC encoder according to claim 1, wherein: the thread grouping in step 8.1 specifically includes: the GPU divides 256 threads in each block into 16 groups, zc is 384 at most, each subgroup contains 384 bits of data at most, each 16 threads are required to be a cooperation group, and each thread is responsible for the operation and calculation of 32 bits of data; in processing multi-code LDPC coding, the actual in the cooperative groupThe number of threads participating in the operation is ceil (Zc/32); each cooperation group is responsible for solving the multiplication of the bits corresponding to all cyclic shift blocks of one layer in the check matrix H and the check matrix, a group of results is finally calculated by solving the solutions of each layer, 256 threads support the multiplication of the bits of 16 layers at most in the check matrix H and the check matrix, the solutions are simultaneously calculated, 16 groups of results are calculated each time, and then all threads solve the operation of the following 16 layers until the operation is completedm b Solving the group result;
information bit subgroup in step 8.1s i The multiplying of the corresponding bits by the check matrix specifically includes: result of multiplication by
Figure 27986DEST_PATH_IMAGE001
It is shown that the process of the present invention,s T to representsTranspose, matrix ofA Zc Sum matrixC Zc Composed of a plurality of cyclic shift blocks, thus a matrixA Zc And withs T Multiplication and matrixC Zc Ands T all the multiplications are expressed as
Figure 804312DEST_PATH_IMAGE010
A Zc And withs T The multiplication result of (A) is
Figure 491646DEST_PATH_IMAGE011
Matrix ofC Zc Ands T the multiplication result is
Figure 711537DEST_PATH_IMAGE012
(ii) a Wherein the content of the first and second substances,iindicating the position of the information subset and the number of columns of the cyclic shift block in the base matrix,i= 1, 2, …, k b jrepresenting the number of rows of the cyclic shift block in the base matrix,j= 1, 2, …, m b
Figure 726897DEST_PATH_IMAGE013
which represents a modulo two addition of the two,
Figure 697127DEST_PATH_IMAGE014
represents the first in the base matrixjGo to, judgeiWhether the cyclic shift block of the column is 0, if 0,
Figure 363601DEST_PATH_IMAGE015
if the average molecular weight is not 0,
Figure 3661DEST_PATH_IMAGE016
thus, therefore, it is
Figure 630951DEST_PATH_IMAGE017
Indicates the information bit subgroup corresponding to the non-0 sub-blocks i (ii) a When the cyclic shift block is not 0, the cyclic shift amount is provided
Figure 499812DEST_PATH_IMAGE014
Represents a subgroup of information bitss i Should be subjected to a cyclic shift of magnitude, and therefore
Figure 302683DEST_PATH_IMAGE018
Show that
Figure 972699DEST_PATH_IMAGE017
Is/are as follows
Figure 477499DEST_PATH_IMAGE014
Cyclic displacement results;
the efficient shifting in step 8.1 comprises: traversing each layer, reading the offset H _ shift of the base matrix by the effectively working threads in the cooperation group, wherein the label of each thread is tid; when remainders
Figure 258373DEST_PATH_IMAGE019
At 0, the active worker thread in the cooperative group is responsible for reading two 32-bit data in the cache region on the high-speed chip, from the location lid andposition hid reads two information elements E lid And E hid Wherein the position of the lid is determined according to the offset H _ shift, lid = (tid + H _ shift)% M, the position of the lid is determined by the lid, and lid = (lid + 1)% M, if remainder =0, the thread in the cooperative group directly merges the two information elements into 64 bits [ E [ lid |E hid ]Moving H _ shift%32 bits to the left by an instruction, and then reserving the 32 bits on the left side as output; if remaining renderer>When 0 is needed, the thread in the cooperative group is divided into two operations, the first part of threads are hid% M =0, the register position where the lid is located is the last one of the M registers, the hid is the first register, the two registers are not spatially continuous, the position element of the hid needs to be shifted to the left by 32-remainders to form a new hid position element, and then the two information elements are combined into 64-bit [ E [ lid |E hid ]Second partial thread hid% M>0, representing that the lid and the hid are continuous, and the calculation process is consistent with the condition that remainder =0, forming [ E lid |E hid ]Finally, the pass instruction will [ E lid |E hid ]Shift H _ shift%32 bits to the left, then reserve the left 32 bits as output; when remainders<0, the active worker thread in the cooperative group is responsible for reading a 32-bit message of data in the cache on the cache, and reading the message element E from the location id id Wherein when Zc
Figure 915750DEST_PATH_IMAGE020
16, move left H _ shift%32 bits directly by instruction, when ZC>16, the information element E needs to be replaced id Circularly moving the remainder bit to the right to obtain E id ', merging two information elements into 64 bits [ E ] lid |E hid ]By an instruction will [ E lid |E hid ]Shift H _ shift%32 bits to the left, leaving the left 32 bits as the output result.
7. The encoding method of a GPU-based 5GLDPC encoder according to claim 6, wherein: in step 8.2, listing the corresponding system of equations according to the check matrix H type B _ type includes: by using
Figure 382766DEST_PATH_IMAGE021
Representing a first part of check bitsp a The (c) of (a) the check subgroup,z= 1, 2, 3, 4;
Figure 250228DEST_PATH_IMAGE022
to represent
Figure 975738DEST_PATH_IMAGE023
Is/are as follows
Figure 471311DEST_PATH_IMAGE024
As a result of the cyclic shift of the bits,
Figure 748708DEST_PATH_IMAGE025
i.e. buffered in step 8.1
Figure 447674DEST_PATH_IMAGE001
Modulo two addition of the first 4 groups of information bits; check subgroup
Figure 727608DEST_PATH_IMAGE026
When known, the formula is obtained by an equation system
Figure 218632DEST_PATH_IMAGE027
,
Figure 276718DEST_PATH_IMAGE028
,
Figure 243406DEST_PATH_IMAGE029
8. The encoding method of a GPU-based 5GLDPC encoder according to claim 7, wherein: according to the calculation in step 8.2p a By the efficient shift operation of step 8.1, willp a Multiplying the position corresponding to the check matrix to obtain
Figure 169773DEST_PATH_IMAGE030
Figure 390670DEST_PATH_IMAGE031
(ii) a Will be provided with
Figure 9870DEST_PATH_IMAGE032
And
Figure 699740DEST_PATH_IMAGE033
5 th to 5 th inm b Adding the corresponding positions to obtain a second part of parity check bitsp c The result of (1).
9. The encoding method of a GPU-based 5g lcd pc encoder of claim 1, wherein: in step 9, the valid bits are compressed into: the coded 5GLDPC code is a system code, the code word coded by the information bit is long, the GPU equipment needs to consume a large amount of time to transmit the code word back to a host end, and only the coded check bit part in the cache on the high-speed chip is carried to a global memory by adopting a coding information compression mode, so that the carrying time is saved; when the remainder is 0, the Zc is an integral multiple of 32, and the information bits cached on the cache chip are continuously copied to the global memory; when remainder is not 0, zc is not an integer multiple of 32, bits filled after Zc bits in each information sub-group need to be deleted, each Zc only contains the first Zc information bits, and all bits are shifted and combined to restore to a tight information bit arrangement; and then, data packing is carried out on the check information bits which are arranged closely according to 32 bits as a group, the tail part of the check information bits which are not 32 bits is filled with 0, and the packed data is written into the position indicated by dst _ offset.
CN202211037856.7A 2022-08-29 2022-08-29 Coding method of 5GLDPC encoder based on GPU Active CN115118289B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211037856.7A CN115118289B (en) 2022-08-29 2022-08-29 Coding method of 5GLDPC encoder based on GPU

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211037856.7A CN115118289B (en) 2022-08-29 2022-08-29 Coding method of 5GLDPC encoder based on GPU

Publications (2)

Publication Number Publication Date
CN115118289A CN115118289A (en) 2022-09-27
CN115118289B true CN115118289B (en) 2022-11-18

Family

ID=83336484

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211037856.7A Active CN115118289B (en) 2022-08-29 2022-08-29 Coding method of 5GLDPC encoder based on GPU

Country Status (1)

Country Link
CN (1) CN115118289B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116827358B (en) * 2023-07-13 2024-04-02 白盒子(上海)微电子科技有限公司 5G LDPC coding realization method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109379086A (en) * 2018-10-11 2019-02-22 西安电子科技大学 The 5G LDPC coding method of the code-rate-compatible of low complex degree and encoder
CN111052615A (en) * 2017-06-27 2020-04-21 华为技术有限公司 Information processing method and device and communication equipment
CN114285417A (en) * 2021-12-21 2022-04-05 包滨豪 Encoding method of global coupling low-density parity check code
CN114884618A (en) * 2022-05-09 2022-08-09 北京航空航天大学 GPU-based 5G multi-user LDPC (Low Density parity check) code high-speed decoder and decoding method thereof

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10715276B2 (en) * 2018-05-26 2020-07-14 Ntwine, Llc Bandwidth constrained communication systems with optimized low-density parity-check codes

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111052615A (en) * 2017-06-27 2020-04-21 华为技术有限公司 Information processing method and device and communication equipment
CN109379086A (en) * 2018-10-11 2019-02-22 西安电子科技大学 The 5G LDPC coding method of the code-rate-compatible of low complex degree and encoder
CN114285417A (en) * 2021-12-21 2022-04-05 包滨豪 Encoding method of global coupling low-density parity check code
CN114884618A (en) * 2022-05-09 2022-08-09 北京航空航天大学 GPU-based 5G multi-user LDPC (Low Density parity check) code high-speed decoder and decoding method thereof

Also Published As

Publication number Publication date
CN115118289A (en) 2022-09-27

Similar Documents

Publication Publication Date Title
CN101141133B (en) Method of encoding structured low density check code
CN100425017C (en) Encoder of parallel-convolution LDPC code based on precoding and its fast encoding method
CN101820288B (en) Information processing method of low-density check codes
CN102075198B (en) Quasi-cyclic low-density parity check convolution code coding-decoding system and coding-decoding method thereof
CN115118289B (en) Coding method of 5GLDPC encoder based on GPU
CN107786211B (en) Algebraic structure obtaining method, encoding method and encoder of IRA-QC-LDPC code
CN101667887A (en) Encoding method and device thereof and decoding method and device thereof
CN109347486B (en) Low-complexity high-throughput 5G LDPC (Low-Density parity-check) encoder and encoding method
CN109391360A (en) Data-encoding scheme and device
CN103916134A (en) Low-density parity check code aliasing and decoding method and multi-core collaborative aliasing decoder
CN109802687A (en) A kind of high speed code-rate-compatible LDPC encoder of the QC-LDPC code based on FPGA
CN111555761A (en) Parallel interleaver, deinterleaver and method suitable for 5G-NR
WO2019029726A1 (en) Interleaving method and device
CN1822510A (en) High speed storage demand reducing low density correction code decoder
CN107733439B (en) LDPC (Low Density parity check) coding method, coding device and communication equipment
CN114884618B (en) GPU-based 5G multi-user LDPC code high-speed decoder and decoding method thereof
CN116707545A (en) Low-consumption and high-throughput 5GLDPC decoder implementation method and device
WO2020108306A1 (en) Decoding method, decoding device, and decoder
CN116707546A (en) Hardware implementation method and device for quasi-cyclic LDPC decoding
CN110741559A (en) Polarization encoder, communication unit, integrated circuit and method thereof
CN102201817A (en) Low-power-consumption LDPC (low density parity check) decoder based on optimization of folding structure of memorizer
CN102170333B (en) A kind of parallel calculating method of interleaving address and system
CN110730003B (en) LDPC (Low Density parity check) coding method and LDPC coder
CN108631913A (en) A kind of deinterleaving method and relevant device based on Quasi-cyclic Low-density Parity-check Codes
CN114826283A (en) Decoding method, device, equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant