CN113704520A - Method and device for accelerating Anchor-based data processing by using cuda in parallel and electronic equipment - Google Patents

Method and device for accelerating Anchor-based data processing by using cuda in parallel and electronic equipment Download PDF

Info

Publication number
CN113704520A
CN113704520A CN202111252339.7A CN202111252339A CN113704520A CN 113704520 A CN113704520 A CN 113704520A CN 202111252339 A CN202111252339 A CN 202111252339A CN 113704520 A CN113704520 A CN 113704520A
Authority
CN
China
Prior art keywords
cuda
kernel function
parallel
target coordinate
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111252339.7A
Other languages
Chinese (zh)
Other versions
CN113704520B (en
Inventor
王浩
杨烟台
尹桂信
张天昊
傅春连
周晨磊
张玉晖
宋明武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin (binhai) Intelligence Military-Civil Integration Innovation Center
Original Assignee
Tianjin (binhai) Intelligence Military-Civil Integration Innovation Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin (binhai) Intelligence Military-Civil Integration Innovation Center filed Critical Tianjin (binhai) Intelligence Military-Civil Integration Innovation Center
Priority to CN202111252339.7A priority Critical patent/CN113704520B/en
Publication of CN113704520A publication Critical patent/CN113704520A/en
Application granted granted Critical
Publication of CN113704520B publication Critical patent/CN113704520B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/51Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a processing method and device for accelerating Anchor-based data in parallel by using cuda, electronic equipment and a medium. By applying the technical scheme of the application, the output value of the network model of the single-target tracking algorithm can be transferred to the kernel function on the GPU display card, so that the aim of parallelly processing all subsequent computing tasks by the kernel function is fulfilled. Therefore, the problem of low processing efficiency caused by the fact that the model output parameters need to be copied to the CPU for calculation in the prior art can be solved.

Description

Method and device for accelerating Anchor-based data processing by using cuda in parallel and electronic equipment
Technical Field
The present application relates to data communication technologies, and in particular, to a method, an apparatus, an electronic device, and a medium for accelerating Anchor-based data processing by using cuda in parallel.
Background
The single-target tracking algorithm based on the deep Convolutional Neural Network (CNN) makes breakthrough progress in the field of computer vision, and has two types, namely, Anchor-based and Anchor-free, according to whether a frame needs to be preset for distinguishing, and the application precision of many algorithms gradually reaches the industrial and productive standards, such as daSiamRPN, siammrpn + + and siammfc + +, and the like.
While the accuracy is sought, the industry is also concerned about how to increase the front-end computation speed of the algorithm. Mainstream front-end reasoning frameworks such as Libtorch, TensrT and OpenCV focus on optimizing the forward reasoning speed of the backbone network of the acceleration algorithm model, and the GPU is mostly used for accelerating the execution speed of the algorithm, so that low-delay and high-throughput deployment reasoning is provided for the deep learning algorithm.
The Anchor-based single-target tracking algorithm improves the performance of the algorithm by additionally arranging a large number of preset bounding boxes, and meanwhile, the data copying amount from a GPU (graphics processing unit) device end to a host computer end is increased. However, how to accelerate the calculation speed of the single-target tracking algorithm and reduce the overall inference time of the algorithm becomes a problem to be solved by the technical personnel in the field.
Disclosure of Invention
The embodiment of the application provides a processing method, a processing device, electronic equipment and a medium for accelerating Anchor-based data in parallel by using cuda, wherein according to one aspect of the embodiment of the application, the processing method for accelerating the Anchor-based data in parallel by using the cuda is characterized by being applied to a GPU (graphics processing unit) display card, wherein:
acquiring a first output parameter output by a single-target tracking algorithm network model based on a characteristic image, wherein the first output parameter comprises a target coordinate deviation value and a confidence coefficient;
generating a multi-size preset frame and a Hanning window value in parallel in a cuda kernel function, wherein the multi-size preset frame corresponds to the size of the characteristic image;
carrying out scale punishment and Hanning window processing on the confidence coefficient in the cuda kernel function to obtain a maximum confidence coefficient and an index value;
and determining coordinate offset in the cuda kernel function based on the target coordinate offset value to obtain a target coordinate corresponding to the characteristic image.
Optionally, in another embodiment based on the above method of the present application, the obtaining a first output parameter of the single-target tracking algorithm model output based on the feature image includes:
constructing the single target tracking algorithm network model;
and inputting the characteristic image and the initial target coordinate value into the single target tracking algorithm network model to obtain the target coordinate deviation value and the confidence coefficient.
Optionally, in another embodiment based on the foregoing method of the present application, the generating multiple-size preset bounding boxes and hanning window values in parallel in the cuda kernel function includes:
based on each grid of the feature image, generating the multi-size preset frame with a corresponding shape and a corresponding size in the cuda kernel function in parallel; and the number of the first and second groups,
generating the Hanning window values of the same size as the feature image in parallel in the cuda kernel function.
Optionally, in another embodiment based on the foregoing method of the present application, the performing a scale penalty and a hanning window processing on the confidence level in the cuda kernel function to obtain a maximum confidence level and an index value includes:
generating a scale penalty coefficient corresponding to the size of the characteristic image in parallel in the cuda kernel function, and carrying out scale penalty on the confidence coefficient by using the scale penalty coefficient; and the number of the first and second groups,
and generating Hanning window processing in parallel in the cuda kernel function to obtain the maximum confidence and the index value.
Optionally, in another embodiment based on the foregoing method of the present application, the determining, based on the target coordinate offset value, a coordinate offset in the cuda kernel function to obtain a target coordinate corresponding to the feature image includes:
and carrying out scale transformation on the target coordinate deviation value in the cuda kernel function to obtain a target coordinate corresponding to the characteristic image.
Optionally, in another embodiment based on the method of the present application, the cuda kernel function includes a preKernel kernel function, a scoreKernel kernel function, and a roikernal kernel function.
According to another aspect of the embodiments of the present application, there is provided a processing apparatus for accelerating Anchor-based data in parallel by using cuda, the processing apparatus being applied to a GPU graphics card, wherein:
the acquisition module is configured to acquire a first output parameter output by the single-target tracking algorithm network model based on the characteristic image, wherein the first output parameter comprises a target coordinate deviation value and a confidence coefficient;
a first generation module configured to generate a multi-size preset frame and a Hanning window value in parallel in a cuda kernel, wherein the multi-size preset frame corresponds to the size of the feature image;
the second generation module is configured to perform scale punishment and Hanning window processing on the confidence coefficient in the cuda kernel function to obtain a maximum confidence coefficient and an index value;
and the third generation module is configured to determine coordinate offset in the cuda kernel function based on the target coordinate offset value, so as to obtain a target coordinate corresponding to the feature image.
According to another aspect of the embodiments of the present application, there is provided an electronic device including:
a memory for storing executable instructions; and
and the display is used for displaying with the memory to execute the executable instruction so as to complete the operation of any one of the processing methods for accelerating Anchor-based data in parallel by using cuda.
According to a further aspect of the embodiments of the present application, there is provided a computer-readable storage medium for storing computer-readable instructions, which when executed, perform any one of the above operations of the processing method for accelerating Anchor-based data in parallel by using cuda.
The method is applied to a GPU display card, and comprises the steps of obtaining a first output parameter output by a single-target tracking algorithm network model based on a characteristic image, wherein the first output parameter comprises a target coordinate deviation value and a confidence coefficient; generating a multi-size preset frame and a Hanning window value in parallel in the cuda kernel function, wherein the multi-size preset frame corresponds to the size of the characteristic image; carrying out scale punishment and Hanning window processing on the confidence coefficient in the cuda kernel function to obtain the maximum confidence coefficient and an index value; and determining coordinate offset in the cuda kernel function based on the target coordinate offset value to obtain a target coordinate corresponding to the characteristic image. By applying the technical scheme of the application, the output value of the network model of the single-target tracking algorithm can be transferred to the kernel function on the GPU display card, so that the aim of parallelly processing all subsequent computing tasks by the kernel function is fulfilled. Therefore, the problem of low processing efficiency caused by the fact that the model output parameters need to be copied to the CPU for calculation in the prior art can be solved.
The technical solution of the present application is further described in detail by the accompanying drawings and examples.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description, serve to explain the principles of the application.
The present application may be more clearly understood from the following detailed description with reference to the accompanying drawings, in which:
fig. 1 is a schematic diagram of a processing method for accelerating Anchor-based data by using cuda in parallel according to the present application;
FIG. 2 is a schematic diagram of feature points in a processing process of parallel accelerated Anchor-based data by using cuda according to the present application;
fig. 3 is a schematic structural diagram of an electronic device for processing Anchor-based data by using cuda parallel acceleration according to the present application;
fig. 4 is a schematic structural diagram of an electronic device for processing Anchor-based data by using cuda parallel acceleration according to the present application.
Detailed Description
Various exemplary embodiments of the present application will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present application unless specifically stated otherwise.
Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the application, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
In addition, technical solutions between the various embodiments of the present application may be combined with each other, but it must be based on the realization of the technical solutions by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination of technical solutions should be considered to be absent and not within the protection scope of the present application.
It should be noted that all the directional indicators (such as upper, lower, left, right, front and rear … …) in the embodiment of the present application are only used to explain the relative position relationship between the components, the motion situation, etc. in a specific posture (as shown in the drawings), and if the specific posture is changed, the directional indicator is changed accordingly.
A processing method for performing parallel acceleration of Anchor-based data using cuda according to an exemplary embodiment of the present application is described below with reference to fig. 1-2. It should be noted that the following application scenarios are merely illustrated for the convenience of understanding the spirit and principles of the present application, and the embodiments of the present application are not limited in this respect. Rather, embodiments of the present application may be applied to any scenario where applicable.
The application also provides a processing method and device for accelerating Anchor-based data in parallel by using cuda, electronic equipment and a medium.
Fig. 1 schematically shows a flow chart of a processing method for accelerating Anchor-based data by using cuda in parallel according to an embodiment of the present application. As shown in fig. 1, the method is applied to a GPU graphics card, wherein:
s101, obtaining a first output parameter output by the single-target tracking algorithm network model based on the characteristic image, wherein the first output parameter comprises a target coordinate deviation value and a confidence coefficient.
In the related art, a single-target tracking algorithm based on a deep Convolutional Neural Network (CNN) makes breakthrough progress in the field of computer vision, the application precision of many algorithms gradually reaches the standards of industrialization and commercialization, and the front-end reasoning speed of the algorithms is pursued by the industry while pursuing the precision. The deep neural network has a large number of levels and nodes, so it is very important to consider how to reduce the required memory and the required computation amount, especially for edge computation carrier boards with weaker computation performance.
For the acceleration idea of the single-target tracking algorithm, 2 types are common practice. One of the methods is a mainstream front-end reasoning framework such as Libtorch, TensrT, OpenCV and the like, which focuses on optimizing the forward reasoning speed of a deep neural network backbone network by using a GPU, and provides low-delay and high-throughput deployment reasoning for a deep learning algorithm. The second method is a deep neural network compression and acceleration method, parameter pruning and sharing, low-rank decomposition, migration/compression convolution filter, knowledge refining and the like. Parameter pruning (parameter pruning) and sharing based methods focus on exploring redundant parts of model parameters and attempt to remove redundant and unimportant parameters; a method based on a Low-rank decomposition (Low-rank decomposition) technique uses matrix/tensor decomposition to estimate the most informative parameters in the deep CNN; designing a convolution filter with a special structure based on a migration/compression convolution filter (transformed/compact conditional filters) method to reduce the complexity of storage and calculation; knowledge refinement (knowledge refinement) learns a refinement model, i.e. trains a more compact neural network to reproduce the output results of a large network.
At present, the acceleration of a deep neural network algorithm model focuses on accelerating a network forward reasoning process by means of GPU (graphics processing unit) computing power or cutting and lightening the network model based on a compression and acceleration deep neural network method, so that the speed is increased. For the data stored in the GPU device after the neural network model inference, the subsequent processing is generally copying from the device to the host, and then performing data processing at the CPU, which can be understood as the above-described manner greatly reduces the service processing efficiency.
Further, the method can firstly construct a convolutional neural network TracerNet of the single target tracking algorithm, and output a target coordinate offset value offset and a confidence score corresponding to Anchor Box.
Specifically, for a single target tracking algorithm network model, one approach may be a two-branch neural network structure, which includes dynamic convolution kernel template branches and target tracking branches. It should be noted that the network models can be forward reasoning networks constructed based on Libtorch frameworks, and are obtained by Pytorch model transformation. Or a network model obtained by converting the ONNX model into a forward reasoning network constructed based on a TensorRT framework.
In one mode, the feature image input into the network model of the single target tracking algorithm may be a video sequence image, and the other input may also be an initial target coordinate or a target coordinate Roi [ x, y, w, h ] locked in the previous frame of image. And the output presets a target coordinate offset value offset and a confidence score corresponding to the frame Anchor Box for each frame of image.
Further, the present application may define the kernel function as offsets, scores = tracker net (Frame, Roi), and may be executed on the CUDA kernel function by means of the AI algorithm framework.
And S102, generating a multi-size preset frame and a Hanning window value in parallel in the cuda kernel function, wherein the multi-size preset frame corresponds to the size of the characteristic image.
Specifically, in the embodiment of the present application, a kernel function may be further defined to implement multiscale Scale = [ s1, s2, …, sn ] corresponding to a feature map Size = [ width, height ], to preset frame cuAnchorBoxes and hanning window cuHanning values.
Further, for generating the multi-size preset frame, for each mesh on the feature map, the multi-size preset frame with different shapes and sizes corresponding to each mesh of the Anchor Boxes may be generated in parallel on the kernel function.
In one approach, the application can generate hanning window values of the same size as the feature map size in parallel on the CUDA kernel. The formula for generating the hanning window may be:
p_j=0.5*(1–cos(PI*i/((j–1))));
0 ≦ i ≦ j, j ∈ { width, height }, and PI =3.14159 further, for defining the kernel function as cuAnchorBoxes, cuHanning = preKernel (Scale, Size). And further executing the operation on the CUDA core in parallel.
And S103, carrying out scale punishment and Hanning window processing on the confidence coefficient in the cuda kernel function to obtain the maximum confidence coefficient and an index value.
In one mode, the embodiment of the application can implement scale punishment and hanning window processing of the confidence coefficient in parallel on the kernel function to obtain the maximum confidence coefficient and the index value index. Specifically, the method and the device need to generate the scale penalty coefficient penalty corresponding to the confidence degree size on the CUDA core in parallel. And further, punishing the confidence level according to the scale punishment coefficient.
Further, hanning window processing of the confidence value can be optimized based on parallel computing, so that the maximum confidence score and the index value index are obtained.
Finally, in the embodiment of the present application, the kernel function may be defined as:
score, index = scoreKernel (cuanchromboses, cuHanning, offsets, scores) and operations are performed in parallel on CUDA cores.
And S104, determining coordinate offset in the cuda kernel function based on the target coordinate offset value to obtain a target coordinate corresponding to the characteristic image.
Further, the embodiment of the present application may optimize the coordinate offset of the target coordinate offset value based on parallel computation, so as to obtain the target coordinate Roi [ x, y, w, h ]. And the kernel function in the embodiment of the present application can also be defined as:
roi = roiKernel (offsets, index), and operations are performed in parallel on the CUDA core.
The method is applied to a GPU display card, and comprises the steps of obtaining a first output parameter output by a single-target tracking algorithm network model based on a characteristic image, wherein the first output parameter comprises a target coordinate deviation value and a confidence coefficient; generating a multi-size preset frame and a Hanning window value in parallel in the cuda kernel function, wherein the multi-size preset frame corresponds to the size of the characteristic image; carrying out scale punishment and Hanning window processing on the confidence coefficient in the cuda kernel function to obtain the maximum confidence coefficient and an index value; and determining coordinate offset in the cuda kernel function based on the target coordinate offset value to obtain a target coordinate corresponding to the characteristic image. By applying the technical scheme of the application, the output value of the network model of the single-target tracking algorithm can be transferred to the kernel function on the GPU display card, so that the aim of parallelly processing all subsequent computing tasks by the kernel function is fulfilled. Therefore, the problem of low processing efficiency caused by the fact that the model output parameters need to be copied to the CPU for calculation in the prior art can be solved.
Optionally, in another embodiment based on the method of the present application, obtaining a first output parameter of the single-target tracking algorithm model output based on the feature image includes:
constructing the single target tracking algorithm network model;
and inputting the characteristic image and the initial target coordinate value into the single target tracking algorithm network model to obtain the target coordinate deviation value and the confidence coefficient.
Optionally, in another embodiment based on the foregoing method of the present application, the generating multiple-size preset bounding boxes and hanning window values in parallel in the cuda kernel function includes:
based on each grid of the feature image, generating the multi-size preset frame with a corresponding shape and a corresponding size in the cuda kernel function in parallel; and the number of the first and second groups,
generating the Hanning window values of the same size as the feature image in parallel in the cuda kernel function.
Optionally, in another embodiment based on the foregoing method of the present application, the performing a scale penalty and a hanning window processing on the confidence level in the cuda kernel function to obtain a maximum confidence level and an index value includes:
generating a scale penalty coefficient corresponding to the size of the characteristic image in parallel in the cuda kernel function, and carrying out scale penalty on the confidence coefficient by using the scale penalty coefficient; and the number of the first and second groups,
and generating Hanning window processing in parallel in the cuda kernel function to obtain the maximum confidence and the index value.
Optionally, in another embodiment based on the foregoing method of the present application, the determining, based on the target coordinate offset value, a coordinate offset in the cuda kernel function to obtain a target coordinate corresponding to the feature image includes:
and carrying out scale transformation on the target coordinate deviation value in the cuda kernel function to obtain a target coordinate corresponding to the characteristic image.
Optionally, in another embodiment based on the method of the present application, the cuda kernel function includes a preKernel kernel function, a scoreKernel kernel function, and a roikernal kernel function.
Further, the method for parallel acceleration of CUDA post-processing of the single target tracking algorithm Based on Anchor-Based is adopted, wherein a convolutional neural network of the single target tracking algorithm needs to be constructed firstly, and a two-branch network structure model is constructed for a Libtrch frame.
Still further, the preset frame and the Hanning window need to be generated in the kernel function in an optimized mode in parallel. And carrying out parallel optimization and scale punishment and Hanning window processing on the confidence coefficient so as to obtain the maximum confidence coefficient and an index value. Moreover, coordinate scale conversion in the kernel function needs to be optimized in parallel to achieve the purpose of updating the target coordinate. And finally, optionally, iteratively executing a subsequent image sequence to search the target until a target tracking task is completed.
It can be understood that by applying the above technical means, memory copy between the server and the device can be reduced, data transmission bandwidth is saved, the computational efficiency is improved by efficiently utilizing the CUDA parallel architecture, and the inference speed and real-time performance of the algorithm are improved.
In one mode, as shown in fig. 2, fig. 2 is a schematic diagram of generating multiple preset borders in parallel in the cuda kernel, and as can be seen from fig. 2, each feature image includes multiple borders with different sizes at feature points.
The processing method for accelerating Anchor-based data in parallel by using cuda comprises the following steps:
the first step is as follows: in one mode, the algorithm of the embodiment adopts a DaSiamRPN single-target tracking algorithm or a Libtorch Script inference framework network model.
Further, the convolutional neural network of the target tracking algorithm inputs the image in the video sequence, the initial target pixel coordinate or the target pixel coordinate locked by the previous frame of image, and outputs the coordinate offset value and the confidence corresponding to the Anchor Box.
The second step is that: and generating multi-scale preset frames cuAnchorBoxes and a Hanning window cuHanning corresponding to the size of the feature map.
Further, cuAnchorBoxes, cuHanning = preKernel (Scale, Size) kernel was constructed, where Scale = [0.33,0.5,1,2,3], Size = [19,19 ].
Parallelly realizing a preset frame Roi [ x, y, w, h ] of the cuAnchorBoxes in a kernel function, wherein the size of the preset frame is [19,19,5,4 ]; wherein cuAnchorBoxes [ i ]. x = xGrid, cuAnchorBoxes [ i ]. y = uGrid, cuAnchorBoxes [ i ]. w = wScale, and cuAnchorBoxes [ i ]. h = hScale. Wherein i = threadidx.x + blockidx.x × blockdim.x, xGrid, yGrid, wScale, and hScale are obtained by Scale transformation;
further, cuHanning is implemented in parallel in the kernel function, wherein cuHanning [ i ] = transit (p _ width) [ i ]. p _ height [ i ], wherein i = readidx.x + blockidx.x.blockdim.x.
The third step: and carrying out scale punishment and Hanning window processing on the confidence scores in the kernel function to obtain the maximum confidence score and the index value index.
Further, a kernel function is constructed:
score,index=scoreKernel(cuAnchorBoxes,cuHanning,offsets,scores)
further, a scale penalty is added to the confidence scores based on parallel computing optimization, scores [ i ] = scores [ i ]. dependency [ i ], wherein i = readidx.x + blockidx.x. blockdim.x, and the instance penalty is calculated from a target offset;
further, based on parallel computing optimization, Chinese window processing is carried out on confidence scores, wherein scores [ j ] = CuHanning [ j ]. scores [ j ], wherein j = threadIdx.x + blockIdx.x. blockDim.x, and the maximum confidence score and the index value index are obtained based on a parallel optimization thrast library in CUDA;
the fourth step: carrying out scale conversion on the offsets coordinates in the kernel function to obtain target coordinates Roi [ x, y, w, h ];
further, constructing a kernel function Roi = roiKernel (offsets, index), calculating to obtain a target position, and then copying the target position from the GPU device to the host; the scale conversion algorithm comprises the following steps of (1) calculating the scale conversion algorithm, wherein the scale conversion algorithm comprises the following steps of (i) Roi [ i ]. x = x, Roi [ i ]. y = y, Roi [ i ]. w = w, and Roi [ i ]. h = h;
the fifth step: and iteratively executing the subsequent image sequence to search the target until the target tracking task is completed.
Further, perform offsets, scores = tracker net (Frame, Roi);
further, cuAnchorsBoxes, cuHanning = preKernel < < < blocks, threads > > (Scale, Size) are executed, wherein thread blocks =19 × 5, and the number of threads is = 1; further, score, index = score kernel < < < blocks, reads > > (cuAnchorBoxes, cuHanning, offsets, scores) is performed.
Wherein, the thread blocks =19 × 5, and the thread number threads = 1;
further, Roi = roiKernel < < < blocks, threads > > (offsets, index) is performed,
the thread blocks =1, and the thread number threads = 1.
In another embodiment of the present application, as shown in fig. 3, the present application further provides a processing apparatus for accelerating Anchor-based data in parallel by using cuda. The method is applied to a GPU display card, wherein:
an obtaining module 201 configured to obtain a first output parameter output by the single-target tracking algorithm network model based on the feature image, where the first output parameter includes a target coordinate offset value and a confidence level;
a first generating module 202 configured to generate a multi-size preset frame and a hanning window value in parallel in the cuda kernel, wherein the multi-size preset frame corresponds to the size of the feature image;
the second generating module 203 is configured to perform scale penalty and hanning window processing on the confidence coefficient in the cuda kernel function to obtain a maximum confidence coefficient and an index value;
a third generating module 204, configured to determine a coordinate offset in the cuda kernel function based on the target coordinate offset value, and obtain a target coordinate corresponding to the feature image.
The method is applied to a GPU display card, and comprises the steps of obtaining a first output parameter output by a single-target tracking algorithm network model based on a characteristic image, wherein the first output parameter comprises a target coordinate deviation value and a confidence coefficient; generating a multi-size preset frame and a Hanning window value in parallel in the cuda kernel function, wherein the multi-size preset frame corresponds to the size of the characteristic image; carrying out scale punishment and Hanning window processing on the confidence coefficient in the cuda kernel function to obtain the maximum confidence coefficient and an index value; and determining coordinate offset in the cuda kernel function based on the target coordinate offset value to obtain a target coordinate corresponding to the characteristic image. By applying the technical scheme of the application, the output value of the network model of the single-target tracking algorithm can be transferred to the kernel function on the GPU display card, so that the aim of parallelly processing all subsequent computing tasks by the kernel function is fulfilled. Therefore, the problem of low processing efficiency caused by the fact that the model output parameters need to be copied to the CPU for calculation in the prior art can be solved.
In another embodiment of the present application, the obtaining module 201 further includes:
an obtaining module 201 configured to construct the single target tracking algorithm network model;
the obtaining module 201 is configured to input the feature image and the initial target coordinate value into the single target tracking algorithm network model, so as to obtain the target coordinate offset value and the confidence.
In another embodiment of the present application, the first generating module 202 further includes:
a first generating module 202 configured to generate the multi-sized preset bounding box of the corresponding shape and the corresponding size in parallel in the cuda kernel based on each mesh of the feature image; and the number of the first and second groups,
a first generating module 202 configured to generate the hanning window values of the same size as the feature image in parallel in the cuda kernel.
In another embodiment of the present application, the second generating module 203 further includes:
a second generating module 203, configured to generate a scale penalty coefficient corresponding to the feature image size in parallel in the cuda kernel function, and perform a scale penalty on the confidence by using the scale penalty coefficient; and the number of the first and second groups,
a second generating module 203 configured to generate hanning window processing in parallel in the cuda kernel function, resulting in the maximum confidence and an index value.
In another embodiment of the present application, the third generating module 204 further includes:
a third generating module 204, configured to perform scale transformation on the target coordinate offset value in the cuda kernel function to obtain a target coordinate corresponding to the feature image.
In another embodiment of the present application, the cuda kernel functions include a preKernel kernel function, a scoreKernel kernel function, and a roiKernel kernel function.
Fig. 4 is a block diagram illustrating a logical structure of an electronic device in accordance with an exemplary embodiment. For example, the electronic device 300 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
In an exemplary embodiment, there is also provided a non-transitory computer-readable storage medium, such as a memory, including instructions executable by an electronic device processor to perform the above-described processing method for accelerating Anchor-based data in parallel by cuda, the method being applied on a GPU graphics card, wherein: acquiring a first output parameter output by a single-target tracking algorithm network model based on a characteristic image, wherein the first output parameter comprises a target coordinate deviation value and a confidence coefficient; generating a multi-size preset frame and a Hanning window value in parallel in a cuda kernel function, wherein the multi-size preset frame corresponds to the size of the characteristic image; carrying out scale punishment and Hanning window processing on the confidence coefficient in the cuda kernel function to obtain a maximum confidence coefficient and an index value; and determining coordinate offset in the cuda kernel function based on the target coordinate offset value to obtain a target coordinate corresponding to the characteristic image. Optionally, the instructions may also be executable by a processor of the electronic device to perform other steps involved in the exemplary embodiments described above. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
In an exemplary embodiment, there is also provided an application/computer program product including one or more instructions executable by a processor of an electronic device to perform the above processing method for parallel acceleration of Anchor-based data by cuda, the method being applied on a GPU graphics card, wherein: acquiring a first output parameter output by a single-target tracking algorithm network model based on a characteristic image, wherein the first output parameter comprises a target coordinate deviation value and a confidence coefficient; generating a multi-size preset frame and a Hanning window value in parallel in a cuda kernel function, wherein the multi-size preset frame corresponds to the size of the characteristic image; carrying out scale punishment and Hanning window processing on the confidence coefficient in the cuda kernel function to obtain a maximum confidence coefficient and an index value; and determining coordinate offset in the cuda kernel function based on the target coordinate offset value to obtain a target coordinate corresponding to the characteristic image. Optionally, the instructions may also be executable by a processor of the electronic device to perform other steps involved in the exemplary embodiments described above.
Fig. 4 is an exemplary diagram of an electronic device 300. Those skilled in the art will appreciate that the schematic diagram 4 is merely an example of the electronic device 300 and does not constitute a limitation of the electronic device 300 and may include more or less components than those shown, or combine certain components, or different components, for example, the electronic device 300 may also include input-output devices, network access devices, buses, etc.
The processor 301 may be referred to as a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU) for short. Wherein the processor 301 is the control center of the electronic device 300 and connects the various parts of the entire electronic device 300 using various interfaces and lines.
The memory 302 may be used to store computer readable instructions and the processor 301 may implement various functions of the electronic device 300 by executing or executing computer readable instructions or modules stored in the memory 302 and by invoking data stored in the memory 302. The memory 302 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to the use of the electronic device 300, and the like. In addition, the Memory 302 may include a hard disk, a Memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Memory Card (Flash Card), at least one disk storage device, a Flash Memory device, a Read-Only Memory (ROM), a Random Access Memory (RAM), or other non-volatile/volatile storage devices.
The modules integrated by the electronic device 300 may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by the present invention, and can also be realized by relevant hardware through computer readable instructions, which can be stored in a computer readable storage medium, and when the computer readable instructions are executed by a processor, the steps of the above described method embodiments can be realized.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (9)

1. A processing method for accelerating Anchor-based data in parallel by using cuda is characterized by being applied to a GPU (graphics processing Unit) display card, wherein:
acquiring a first output parameter output by a single-target tracking algorithm network model based on a characteristic image, wherein the first output parameter comprises a target coordinate deviation value and a confidence coefficient;
generating a multi-size preset frame and a Hanning window value in parallel in a cuda kernel function, wherein the multi-size preset frame corresponds to the size of the characteristic image;
carrying out scale punishment and Hanning window processing on the confidence coefficient in the cuda kernel function to obtain a maximum confidence coefficient and an index value;
and determining coordinate offset in the cuda kernel function based on the target coordinate offset value to obtain a target coordinate corresponding to the characteristic image.
2. The method of claim 1, wherein obtaining a first output parameter of the single-target tracking algorithm network model based on the feature image output comprises:
constructing the single target tracking algorithm network model;
and inputting the characteristic image and the initial target coordinate value into the single target tracking algorithm network model to obtain the target coordinate deviation value and the confidence coefficient.
3. The method of claim 1, wherein the generating the multi-size preset bounding box and the hanning window value in parallel in the cuda kernel function comprises:
based on each grid of the feature image, generating the multi-size preset frame with a corresponding shape and a corresponding size in the cuda kernel function in parallel; and the number of the first and second groups,
generating the Hanning window values of the same size as the feature image in parallel in the cuda kernel function.
4. The method of claim 1, wherein the performing a scale penalty and a hanning window on the confidence level in the cuda kernel function to obtain a maximum confidence level and an index value comprises:
generating a scale penalty coefficient corresponding to the size of the characteristic image in parallel in the cuda kernel function, and carrying out scale penalty on the confidence coefficient by using the scale penalty coefficient; and the number of the first and second groups,
and generating Hanning window processing in parallel in the cuda kernel function to obtain the maximum confidence and the index value.
5. The method as claimed in claim 1, wherein the determining a coordinate offset in the cuda kernel based on the target coordinate offset value to obtain a target coordinate corresponding to the feature image comprises:
and carrying out scale transformation on the target coordinate deviation value in the cuda kernel function to obtain a target coordinate corresponding to the characteristic image.
6. The method of any one of claims 1-5, wherein the cuda kernel functions comprise a preKernel kernel function, a scoreKernel kernel function, and a roiKernel kernel function.
7. A processing device for accelerating Anchor-based data in parallel by using cuda is applied to a GPU (graphics processing Unit) display card, wherein:
the acquisition module is configured to acquire a first output parameter output by the single-target tracking algorithm network model based on the characteristic image, wherein the first output parameter comprises a target coordinate deviation value and a confidence coefficient;
a first generation module configured to generate a multi-size preset frame and a Hanning window value in parallel in a cuda kernel, wherein the multi-size preset frame corresponds to the size of the feature image;
the second generation module is configured to perform scale punishment and Hanning window processing on the confidence coefficient in the cuda kernel function to obtain a maximum confidence coefficient and an index value;
and the third generation module is configured to determine coordinate offset in the cuda kernel function based on the target coordinate offset value, so as to obtain a target coordinate corresponding to the feature image.
8. An electronic device, comprising:
a memory for storing executable instructions; and the number of the first and second groups,
a processor for displaying with the memory to execute the executable instructions to complete the operations of any one of claims 1-6 utilizing cuda to accelerate the processing method of Anchor-based data in parallel.
9. A computer-readable storage medium storing computer-readable instructions, wherein the instructions, when executed, perform the operations of any one of claims 1 to 6 using cuda to parallel accelerate the processing method of Anchor-based data.
CN202111252339.7A 2021-10-27 2021-10-27 Method and device for accelerating Anchor-based data processing by using cuda in parallel and electronic equipment Active CN113704520B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111252339.7A CN113704520B (en) 2021-10-27 2021-10-27 Method and device for accelerating Anchor-based data processing by using cuda in parallel and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111252339.7A CN113704520B (en) 2021-10-27 2021-10-27 Method and device for accelerating Anchor-based data processing by using cuda in parallel and electronic equipment

Publications (2)

Publication Number Publication Date
CN113704520A true CN113704520A (en) 2021-11-26
CN113704520B CN113704520B (en) 2022-03-08

Family

ID=78647007

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111252339.7A Active CN113704520B (en) 2021-10-27 2021-10-27 Method and device for accelerating Anchor-based data processing by using cuda in parallel and electronic equipment

Country Status (1)

Country Link
CN (1) CN113704520B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102880509A (en) * 2012-09-17 2013-01-16 北京大学 Compute unified device architecture (CUDA) based grid digital elevation model (DEM) neighborhood analysis system and method
CN107124286A (en) * 2016-02-24 2017-09-01 深圳市知穹科技有限公司 A kind of mass data high speed processing, the system and method for interaction
CN108564213A (en) * 2018-04-10 2018-09-21 中国水利水电科学研究院 Parallel reservoir group flood control optimal scheduling method based on GPU acceleration
WO2021007514A1 (en) * 2019-07-10 2021-01-14 Schlumberger Technology Corporation Active learning for inspection tool
CN112986944A (en) * 2021-03-04 2021-06-18 西安电子科技大学 CUDA heterogeneous parallel acceleration-based radar MTI and MTD implementation method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102880509A (en) * 2012-09-17 2013-01-16 北京大学 Compute unified device architecture (CUDA) based grid digital elevation model (DEM) neighborhood analysis system and method
CN107124286A (en) * 2016-02-24 2017-09-01 深圳市知穹科技有限公司 A kind of mass data high speed processing, the system and method for interaction
CN108564213A (en) * 2018-04-10 2018-09-21 中国水利水电科学研究院 Parallel reservoir group flood control optimal scheduling method based on GPU acceleration
WO2021007514A1 (en) * 2019-07-10 2021-01-14 Schlumberger Technology Corporation Active learning for inspection tool
CN112986944A (en) * 2021-03-04 2021-06-18 西安电子科技大学 CUDA heterogeneous parallel acceleration-based radar MTI and MTD implementation method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
单莉、梁煜博: "面向 Anchor - Based 单阶段目标检测算法的精度提升研究", 《北京工业职业技术学院学报》 *
张乐乐章: "将Pytorch模型从CPU转换成GPU", 《博客园》 *

Also Published As

Publication number Publication date
CN113704520B (en) 2022-03-08

Similar Documents

Publication Publication Date Title
DE112020004625T5 (en) TRANSPOSED CONVOLUTION WITH SYSTOLIC ARRAY
CN109949219B (en) Reconstruction method, device and equipment of super-resolution image
DE112020003128T5 (en) DILATED CONVOLUTION WITH SYSTOLIC ARRAY
CN109583509B (en) Data generation method and device and electronic equipment
CN109086877A (en) A kind of device and method for executing convolutional neural networks forward operation
CN111931901A (en) Neural network construction method and device
CN114491399A (en) Data processing method and device, terminal equipment and computer readable storage medium
CN112508190A (en) Method, device and equipment for processing structured sparse parameters and storage medium
CN111738435A (en) Online sparse training method and system based on mobile equipment
CN111242286A (en) Data format conversion method and device and computer readable storage medium
CN109685208B (en) Method and device for thinning and combing acceleration of data of neural network processor
CN109697083B (en) Fixed-point acceleration method and device for data, electronic equipment and storage medium
CN113449878B (en) Data distributed incremental learning method, system, equipment and storage medium
CN113704520B (en) Method and device for accelerating Anchor-based data processing by using cuda in parallel and electronic equipment
CN113780365A (en) Sample generation method and device
CN116051699B (en) Dynamic capture data processing method, device, equipment and storage medium
CN116503608A (en) Data distillation method based on artificial intelligence and related equipment
CN113112084B (en) Training plane rear body research and development flow optimization method and device
CN111652051B (en) Face detection model generation method, device, equipment and storage medium
CN113840169B (en) Video processing method, device, computing equipment and storage medium
CN114155276A (en) Single-target tracking method and device, electronic equipment and storage medium
CN113724176A (en) Multi-camera motion capture seamless connection method, device, terminal and medium
CN114626284A (en) Model processing method and related device
CN111833395A (en) Direction-finding system single target positioning method and device based on neural network model
CN110428453A (en) Data processing method, device, data processing equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant