CN114978200B

CN114978200B - High-throughput large-bandwidth general channelized GPU algorithm

Info

Publication number: CN114978200B
Application number: CN202210894341.2A
Authority: CN
Inventors: 韩周安; 张文权; 黄建; 王波; 于延辉
Original assignee: Chengdu Piao Technology Co ltd
Current assignee: Chengdu Piao Technology Co ltd
Priority date: 2022-07-28
Filing date: 2022-07-28
Publication date: 2022-10-21
Anticipated expiration: 2042-07-28
Also published as: CN114978200A

Abstract

The invention discloses a general channelized GPU algorithm with high throughput and large bandwidth, which is based on a CUDA computing platform and comprises the following steps: s1: a calculation initialization module is used for applying for storage space of input, output and intermediate variables required in the channelization process in advance; s2: setting related parameters of the target narrowband signal by using a parameter configuration module and configuring the related parameters and the storage resources allocated in the calculation initialization module according to the related parameters; s3: and executing actual channelization calculation by using the core function module, distributing corresponding calculation resources to each path of narrow-band signals according to the step S2, and starting execution processing of each sub-function module in the channelization process. The invention can keep higher calculation performance and meet the real-time processing index of the relevant application scene. The invention can be deployed and configured on a multi-generation GPU hardware platform supporting CUDA, and the algorithm performance can be self-adaptively matched with the computing capability of GPU hardware, thereby supporting flexible configuration of channelized scenes.

Description

High-throughput large-bandwidth general channelized GPU algorithm

Technical Field

The invention relates to the technical field of communication, in particular to a general channelized GPU algorithm with high throughput and large bandwidth.

Background

The channelization technology is one of the key technologies in a digital wideband receiver, and is used for extracting a single or multiple mutually independent target narrowband signals contained in a digital wideband sampling signal, wherein processing flows such as digital down-conversion (DDC), filtering (Filter), and Sample Rate Conversion (SRC) are involved.

With the continuous progress of modern communication technology, the input bandwidth of digital broadband sampling signals in related channelization processing services is larger and larger, and target narrowband signals are processed more and more, so that resources consumed by an early channelization platform scheme realized based on a dedicated hardware module are more and more unacceptable. Moreover, the flexible and variable requirements for functional design in modern software radio promote the channelization processing to be realized by using a program algorithm, thereby not only reducing the complexity of system design, but also enabling the later-stage functional adjustment to be more flexible and convenient.

The demand of large-scale scientific research and engineering application on computing performance is continuously increased, and computer technology develops to present 'many-core' systems through 'multi-core' systems in the past decade, and particularly heterogeneous parallel acceleration platforms represented by a CPU (Central processing Unit) and a GPU (graphics processing Unit) are more and more widely applied.

The GPU has hundreds of processing cores, the memory access bandwidth is far higher than that of the CPU, the theoretical peak floating point computing performance of the GPU is improved by orders of magnitude compared with that of the CPU, with the emergence of various heterogeneous programming languages, the programming technology in the aspect of universal parallel computing on the GPU is gradually popularized to the public of developers, the GPU is gradually evolved from a special image acceleration processor into a general computing accelerator nowadays, a unique system architecture of the GPU is specially optimized for data large-scale parallel computing scenes similar to SIMD (single instruction multiple data) and SPMD (distributed computer program) styles, particularly a CUDA (compute unified device) computing platform and a programming model which are introduced by NVIDIA (network virtualization information systems) companies bring a wind and rain-fast computing revolution to the computer industry, and the GPU promotes the accelerated development and ecological prosperity of a plurality of industries such as HPC, AI (analog to digital assistant), automatic driving, finance, medical treatment and the like.

Modern radio communication and digital signal processing are moving towards the directions of high, fine and sharp, and the computational bottleneck faced by the front end of the digital broadband receiver is more and more obvious, so that the development of the channelized GPU algorithm has extremely important practical significance.

The prior art has the following defects:

although the channelization solution realized in the traditional CPU (single core, multi core) has the advantages of flexibility, easy use and low development cost, the time consumption is serious and the real-time performance is poor due to the computational power limitation of the CPU, and the channelization service under the condition of heavy computational load cannot be satisfied;

although the channelized platform solution realized based on relevant special hardware (such as ASIC and FPGA) can meet the performance requirement of real-time performance, the development cost is high, the period is long, the resource overhead is large, and the harsh condition limit in the actual application scene can not be met;

with the existing communication application processing system having more and more abundant integrated functions and flexible and variable services, the related functional modules need to be frequently adjusted, so that the scheme based on the dedicated hardware mechanism cannot meet the requirement.

Disclosure of Invention

Aiming at the problems, the invention provides a high-throughput large-bandwidth general channelized GPU algorithm which is realized based on a CUDA computing platform and a programming model, and provides auxiliary acceleration processing for load processing of related channelized tasks by carrying a GPU product of NVIDIA company in a system platform.

The invention adopts the following technical scheme:

a high-throughput large-bandwidth general channelized GPU algorithm comprises the following steps:

s1: setting the number of supported parallel maximum channels and the signal length of single processing by using a calculation initialization module, and then applying for a storage space of input, output and intermediate variables required in channelization process processing in advance according to the parameter values, wherein the storage space comprises a host memory area and an equipment memory area;

s2: setting related parameters of a target narrowband signal by using a parameter configuration module and configuring the related parameters and storage resources allocated in a calculation initialization module according to the related parameters, wherein the related parameters comprise center frequency, bandwidth, output sampling rate and target gain, and the configuration processing comprises a batch configuration mode before calculation and a dynamic configuration mode during calculation;

s3: performing actual channelized computation by using a core function module, allocating corresponding computing resources to each path of narrow-band signal according to the step S2, and starting execution processing of each sub-function module in a channelized process, wherein the computing resources include a thread Grid and a thread Block in a CUDA, the step S3 sub-function module includes a frequency mixing sub-module, a half-band filtering sub-module, a resampling processing sub-module and a low-pass filtering sub-module, and the specific operation steps are as follows:

s301: transmitting broadband input signal data from a host memory buffer area to a corresponding equipment memory buffer area, wherein the host memory buffer area uses a page locking memory in order to accelerate the data transmission efficiency;

s302: performing heterodyne frequency mixing operation on the signal data of the device memory buffer area in the step 301 according to the frequency mixing factor of each path of signal, wherein in order to improve the performance, the frequency mixing factor corresponding to each path of signal is calculated in advance when the parameter configuration module sets the information of each path of signal;

s303: performing half-band filtering processing by taking the heterodyne mixing calculation intermediate result corresponding to each path of signal in the step S302 as input;

s304: taking the intermediate result of the corresponding half-band filtering processing of each path of signal in the step S303 as input to perform resampling processing;

s305: performing low-pass filtering by taking the intermediate result of the resampling processing corresponding to each path of signal in the step S304 as an input;

s306: the low-pass filtering result corresponding to each path of signal in step S305 is transmitted from the corresponding device memory buffer to the host memory buffer and used as the final channelization result.

Preferably, the storage space of step S1 is organized as: all the input, output and intermediate results of the narrow-band signals are distributed with a large continuous storage space, wherein each path of signal sequentially occupies an interval section with the size of a data step, and the length countIn and the length countOut of the corresponding input signal and the output signal during the processing of each functional module in the channelized service flow do not exceed the size of the data step, so that data collision and pollution among the paths of narrow-band signals are prevented during the parallel processing of data in the algorithm.

Preferably, the calculation initialization module is executed only once according to the set parameters, and the applied related storage resources are multiplexed in the subsequent calling of the core function module.

Preferably, in a scenario where a narrow-band signal parameter is dynamically added in the calculation, the parameter configuration module adopts a lock synchronization mechanism, and only after the parameter configuration module is set and unlocked, the core function module can safely access the dynamically updated signal parameter information.

Preferably, the data of the memory buffer area called by the core function module each time is overwritten, so that other subsequent calculation processing or copy storage needs to be performed on the output result before the core function module is called next time for processing.

The beneficial effects of the invention are: by utilizing the strong parallel data processing capability of the GPU, the high calculation performance can be still kept in the channelized processing situation with high throughput, large bandwidth and more total number of target narrow-band signals, and the real-time processing index of a related application scene is met. In addition, due to the compatibility and expandability of the CUDA programming model, the invention can be deployed and configured on a multi-generation GPU hardware platform supporting the CUDA, and the algorithm performance can be adaptively matched with the computing capacity of the GPU hardware, so that the flexible configuration of channelized scenes can be supported, namely the GPU hardware with low cost is configured in a simple scene, and the GPU hardware with higher configuration end in a complex scene.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings of the embodiments will be briefly described below, and it is apparent that the drawings in the following description only relate to some embodiments of the present invention and are not limiting on the present invention.

FIG. 1 is a block diagram of the system architecture of the present invention;

FIG. 2 is an organization of the device memory space allocated inside the computing initialization module according to the present invention;

FIG. 3 is a diagram illustrating the configuration of internal signals and corresponding storage resources of a parameter configuration module according to the present invention;

FIG. 4 is a diagram illustrating an organization of computing resources by threads within a core function module according to the present invention;

fig. 5 is a block diagram of signal processing flow of each path inside the core function module according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings of the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the invention without any inventive step, are within the scope of protection of the invention.

Unless otherwise defined, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The use of the word "comprising" or "comprises", and the like, in this disclosure is intended to mean that the elements or items listed before that word, include the elements or items listed after that word, and their equivalents, without excluding other elements or items. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

The invention is further illustrated with reference to the following figures and examples.

As shown in fig. 1, a high-throughput and large-bandwidth general channelized GPU algorithm is implemented based on a CUDA computing platform and a programming model, and provides auxiliary acceleration processing for load processing of related channelized tasks by mounting a GPU product of NVIDIA corporation in a system platform. In this embodiment, the following steps are performed by two execution platforms GTX1660 and RTX3060 Ti:

s1: using a calculation initialization module to set the supported parameter values such as the number of parallel maximum channels, the signal length of single processing, etc., and then pre-applying for the storage space of input, output and intermediate variables required in the channelization process according to the parameter values, where the storage space includes a host memory area and a device memory area, and the organization form of the storage space is as shown in fig. 2: all the input, output and intermediate results of the narrow-band signals are distributed with a large continuous storage space, wherein each path of signal sequentially occupies an interval section with the size of the dataStep, and the corresponding input signal length countIn and output signal length countOut cannot exceed the size of the dataStep when each functional module in the channelized service flow processes, so that data collision and pollution cannot be generated among the narrow-band signals in each path when data are processed in parallel in the algorithm.

S2: setting relevant parameters of a target narrowband signal by using a parameter configuration module, and configuring the relevant parameters and storage resources allocated in a calculation initialization module according to the relevant parameters, wherein the relevant parameters comprise center frequency, bandwidth, output sampling rate and target gain, and the configuration processing comprises a batch configuration mode before calculation and a dynamic configuration mode during calculation, which are shown in fig. 3:

the batch configuration before calculation refers to that parameter information of all target narrow-band signals is known and set in advance before broadband input signal data is subjected to channelized flow processing, parameters of all target narrow-band signals are configured in advance before a core function module is called to perform DDC flow service processing, and according to the sequence added during batch configuration, all paths of signals correspond to corresponding storage spaces one by one;

the dynamic configuration in the calculation refers to a parameter configuration mode that a broadband input signal data stream executes channelization flow processing, but a narrowband signal needs to be added or deleted and the narrowband signal needs to be instantly validated, and comprises dynamic deletion in the calculation and dynamic addition in the calculation, it can be seen from the figure that in the mode, the added signal has a globally unique number, a dynamic mapping relation is established between the added signal and a storage space area identifier, when the added signal is dynamically deleted, the storage area number mapped with the added signal is searched according to the number of the deleted signal and is set to be in an idle state, and when a core function module is called later, the path of signal cannot be allocated with corresponding thread computing resources. When the deleted signals are more, the module can carry out ascending arrangement management on the idle storage areas according to the identifiers, when the target narrow-band signal is dynamically added at the later stage, the idle storage area identifier at the head of the queue is preferentially selected to establish a mapping relation with the signal, and the idle state mark of the idle storage area identifier is simultaneously cancelled, so that when the core function module is called later, the signal is allocated with corresponding thread computing resources, and the corresponding DDC service flow processing is started.

S3: the core function module is used for executing actual channelized calculation, corresponding calculation resources are distributed to each path of narrow-band signals according to the step S2, execution processing of each sub-function module in the channelized flow is started, the calculation resources comprise a thread Grid and a thread Block in a CUDA (compute unified device architecture), as shown in fig. 4, channelized parallel processing of multiple paths of narrow-band signals is organized into a two-dimensional thread Grid form, and in addition, considering processing of one-dimensional signals, the thread blocks are also selected to be in a one-dimensional form. All the thread blocks with the same y-direction index (blockidx.y) in the two-dimensional thread grid are responsible for channelizing processing tasks of signals mapped by corresponding storage space region identifiers (non-idle states), and a plurality of thread blocks with different x-direction indexes (blockidx.x) are responsible for processing data points in different signal periods.

The step S3 sub-function module, as shown in fig. 5, includes a frequency mixing sub-module, a half-band filtering sub-module, a resampling processing sub-module, and a low-pass filtering sub-module, and specifically includes the following operation steps:

s301: the broadband input signal data is transmitted from the host memory buffer to the corresponding device memory buffer, and in order to accelerate the data transmission efficiency, the host memory buffer uses a page lock memory. The page locking memory is distributed on the memory of the host by a CUDA function cudaHostAlloc, and the important attribute of the page locking memory is that the operating system of the host can not perform paging and exchange operation on the memory of the host, so that the memory is ensured to be always resident in the physical memory;

s302: performing heterodyne frequency mixing operation on the signal data in the device memory buffer area in step 301 according to the frequency mixing factor of each path of signal, in order to improve performance, calculating the frequency mixing factor corresponding to each path of signal in advance when the parameter configuration module sets information of each path of signal, and moving the frequency spectrum structure corresponding to each path of target narrow-band signal to a baseband after frequency mixing processing, namely, the center frequency of the signal becomes "0";

s303: performing half-band filtering processing by taking the heterodyne mixing calculation intermediate result corresponding to each path of signal in the step S302 as an input, configuring a plurality of half-band filtering cycles for each path of signal according to the relationship between the output sampling rate of each path of signal and the original input sampling rate, and storing by using a "constant memory" in a CUDA programming model because the half-band filtering coefficient is relatively fixed, so as to reduce excessive global memory access operations (long-delay and high-overhead operations in the CUDA model) during kernel function calculation;

s304: performing resampling processing by taking the intermediate result of the corresponding half-band filtering processing of each path of signal in the step S303 as input, performing interpolation/sampling processing on the intermediate result according to a pre-calculated factor in a resampling processing sub-module, and after resampling processing, enabling each path of corresponding output result signal to meet the output sampling rate;

s305: performing low-pass filtering processing by taking the resampling processing intermediate result corresponding to each path of signal in the step S304 as an input, performing convolution operation on the low-pass filtering intermediate result by a low-pass filtering submodule according to an FIR low-pass filter coefficient designed by preset parameters, performing performance optimization processing by adopting a shared memory mechanism, and after the low-pass filtering processing, filtering all other signal frequency spectrum components except for the corresponding target narrow-band signal in each path of output signal result to only contain the complete frequency spectrum structure of the target narrow-band signal;

s306: the low-pass filtering result corresponding to each path of signal in step S305 is transmitted from the corresponding device memory buffer to the host memory buffer and used as the final channelization result, because the target narrowband output sampling rate parameters of each path are different, the length of the final output result of each path is different, if the whole memory interval is directly transmitted from the device to the host as shown in fig. 2, a large amount of invalid data is inevitably transmitted, therefore, in order to improve the data processing result transmission efficiency, additional processing is performed inside the core module, the channelization results of each path are reorganized into a head-to-tail continuous storage form, and then are transmitted back to the host memory buffer, and the master regulator returns the parameters according to the result length, the data head address offset and the like to the corresponding host memory buffer index and accesses the final channelization result corresponding to each path of signal.

The calculation initialization module only needs to execute once according to the set parameters, and the applied related storage resources are multiplexed in the subsequent calling of the core function module.

In a scene that a certain narrow-band signal parameter is dynamically added in calculation, the parameter configuration module adopts a lock synchronization mechanism, and only after the parameter configuration module is set and unlocked, the core function module can safely access the dynamically updated signal parameter information.

The data of the memory buffer area called by the core function module each time can be overwritten, so that the output result needs to be subjected to other subsequent calculation processing or copy storage before the core function module is called next time for processing.

The execution results of the two execution platforms GTX1660 and RTX3060Ti are shown in the following table:

as can be seen from the above table, under different processing path number scenarios, the performance indexes of the method for executing channelized task load are all ideal, and the performance of the algorithm is enhanced along with the improvement of the computing capability of the GPU hardware platform.

Although the present invention has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A high-throughput large-bandwidth general channelized GPU algorithm is characterized by being based on a CUDA computing platform and comprising the following steps of:

s1: a calculation initialization module is used for applying for storage space of input, output and intermediate variables required in channelization flow processing in advance, wherein the storage space comprises a host memory area and an equipment memory area;

the storage space is organized in the form that: the input and output of all narrow-band signals and the intermediate result allocate continuous storage space, wherein each path of signal sequentially occupies an interval paragraph with the size of dataStep, and ensures that the corresponding input signal length countIn and output signal length countOut do not exceed the size of dataStep when each functional module processes in the channelized service flow;

the dynamic configuration in the calculation comprises dynamically deleting narrow-band signals and dynamically adding narrow-band signals in the calculation, wherein the dynamically added signals have globally unique numbers, a dynamic mapping relation is established between the dynamically added signals and the storage space region identifiers, the numbers of the storage regions mapped with the dynamically added signals are searched according to the numbers of the deleted signals during dynamic deletion, the dynamically added signals are set to be in an idle state, when more deleted signals exist, the modules carry out ascending order management on the idle storage regions according to the identifiers, when target narrow-band signals are dynamically added in the later period, the idle storage region identifiers at the head of a queue are preferentially selected to establish the mapping relation with the signals, and the idle state identifiers of the idle storage regions are simultaneously cancelled;

s3: and executing actual channelized calculation by using a core function module, distributing corresponding calculation resources to each path of narrow-band signal according to the step S2, and starting execution processing of each sub-function module in the channelized flow, wherein the calculation resources comprise a thread Grid and a thread Block in the CUDA.

2. The high-throughput large-bandwidth general-purpose channelized GPU algorithm according to claim 1, characterized in that said step S3 sub-functional modules comprise a mixing sub-module, a half-band filtering sub-module, a resampling processing sub-module and a low-pass filtering sub-module, and the specific operation steps are as follows:

s301: transmitting broadband input signal data from a host memory buffer area to a corresponding device memory buffer area, wherein the host memory buffer area uses a page lock memory;

s302: performing heterodyne frequency mixing operation on the signal data of the device memory buffer area in the step 301 according to the frequency mixing factor of each path of signal, wherein the frequency mixing factor corresponding to each path of signal is calculated in advance when the parameter configuration module sets the information of each path of signal;

s304: taking the intermediate result of the half-band filtering process corresponding to each path of signal in the step S303 as an input to perform resampling process;

s305: taking the intermediate result of the resampling processing corresponding to each path of signal in the step S304 as an input to carry out low-pass filtering processing;

3. The GPU algorithm of claim 1, in which the compute initialization module only needs to execute once according to the set parameters, and the memory resources it applies for are multiplexed in subsequent core function module calls.

4. The GPU algorithm of claim 1, wherein the parameter configuration module employs a lock synchronization mechanism in a scenario where a narrow-band signal parameter is dynamically added during computation, and only after the parameter configuration module is configured and unlocked, the core function module can safely access the dynamically updated signal parameter information.

5. The GPU algorithm for generic channelization with high throughput and large bandwidth of claim 1, wherein the core function module performs calculation processing or copy storage on the output result before next call processing.