CN113704520A

CN113704520A - Method and device for accelerating Anchor-based data processing by using cuda in parallel and electronic equipment

Info

Publication number: CN113704520A
Application number: CN202111252339.7A
Authority: CN
Inventors: 王浩; 杨烟台; 尹桂信; 张天昊; 傅春连; 周晨磊; 张玉晖; 宋明武
Original assignee: Tianjin (binhai) Intelligence Military-Civil Integration Innovation Center
Current assignee: Tianjin (binhai) Intelligence Military-Civil Integration Innovation Center
Priority date: 2021-10-27
Filing date: 2021-10-27
Publication date: 2021-11-26
Anticipated expiration: 2041-10-27
Also published as: CN113704520B

Abstract

The application discloses a processing method and device for accelerating Anchor-based data in parallel by using cuda, electronic equipment and a medium. By applying the technical scheme of the application, the output value of the network model of the single-target tracking algorithm can be transferred to the kernel function on the GPU display card, so that the aim of parallelly processing all subsequent computing tasks by the kernel function is fulfilled. Therefore, the problem of low processing efficiency caused by the fact that the model output parameters need to be copied to the CPU for calculation in the prior art can be solved.

Description

Method and device for accelerating Anchor-based data processing by using cuda in parallel and electronic equipment

Technical Field

The present application relates to data communication technologies, and in particular, to a method, an apparatus, an electronic device, and a medium for accelerating Anchor-based data processing by using cuda in parallel.

Background

The single-target tracking algorithm based on the deep Convolutional Neural Network (CNN) makes breakthrough progress in the field of computer vision, and has two types, namely, Anchor-based and Anchor-free, according to whether a frame needs to be preset for distinguishing, and the application precision of many algorithms gradually reaches the industrial and productive standards, such as daSiamRPN, siammrpn + + and siammfc + +, and the like.

While the accuracy is sought, the industry is also concerned about how to increase the front-end computation speed of the algorithm. Mainstream front-end reasoning frameworks such as Libtorch, TensrT and OpenCV focus on optimizing the forward reasoning speed of the backbone network of the acceleration algorithm model, and the GPU is mostly used for accelerating the execution speed of the algorithm, so that low-delay and high-throughput deployment reasoning is provided for the deep learning algorithm.

The Anchor-based single-target tracking algorithm improves the performance of the algorithm by additionally arranging a large number of preset bounding boxes, and meanwhile, the data copying amount from a GPU (graphics processing unit) device end to a host computer end is increased. However, how to accelerate the calculation speed of the single-target tracking algorithm and reduce the overall inference time of the algorithm becomes a problem to be solved by the technical personnel in the field.

Disclosure of Invention

The embodiment of the application provides a processing method, a processing device, electronic equipment and a medium for accelerating Anchor-based data in parallel by using cuda, wherein according to one aspect of the embodiment of the application, the processing method for accelerating the Anchor-based data in parallel by using the cuda is characterized by being applied to a GPU (graphics processing unit) display card, wherein:

acquiring a first output parameter output by a single-target tracking algorithm network model based on a characteristic image, wherein the first output parameter comprises a target coordinate deviation value and a confidence coefficient;

generating a multi-size preset frame and a Hanning window value in parallel in a cuda kernel function, wherein the multi-size preset frame corresponds to the size of the characteristic image;

carrying out scale punishment and Hanning window processing on the confidence coefficient in the cuda kernel function to obtain a maximum confidence coefficient and an index value;

and determining coordinate offset in the cuda kernel function based on the target coordinate offset value to obtain a target coordinate corresponding to the characteristic image.

Optionally, in another embodiment based on the above method of the present application, the obtaining a first output parameter of the single-target tracking algorithm model output based on the feature image includes:

constructing the single target tracking algorithm network model;

and inputting the characteristic image and the initial target coordinate value into the single target tracking algorithm network model to obtain the target coordinate deviation value and the confidence coefficient.

Optionally, in another embodiment based on the foregoing method of the present application, the generating multiple-size preset bounding boxes and hanning window values in parallel in the cuda kernel function includes:

based on each grid of the feature image, generating the multi-size preset frame with a corresponding shape and a corresponding size in the cuda kernel function in parallel; and the number of the first and second groups,

generating the Hanning window values of the same size as the feature image in parallel in the cuda kernel function.

Optionally, in another embodiment based on the foregoing method of the present application, the performing a scale penalty and a hanning window processing on the confidence level in the cuda kernel function to obtain a maximum confidence level and an index value includes:

generating a scale penalty coefficient corresponding to the size of the characteristic image in parallel in the cuda kernel function, and carrying out scale penalty on the confidence coefficient by using the scale penalty coefficient; and the number of the first and second groups,

and generating Hanning window processing in parallel in the cuda kernel function to obtain the maximum confidence and the index value.

Optionally, in another embodiment based on the foregoing method of the present application, the determining, based on the target coordinate offset value, a coordinate offset in the cuda kernel function to obtain a target coordinate corresponding to the feature image includes:

and carrying out scale transformation on the target coordinate deviation value in the cuda kernel function to obtain a target coordinate corresponding to the characteristic image.

Optionally, in another embodiment based on the method of the present application, the cuda kernel function includes a preKernel kernel function, a scoreKernel kernel function, and a roikernal kernel function.

According to another aspect of the embodiments of the present application, there is provided a processing apparatus for accelerating Anchor-based data in parallel by using cuda, the processing apparatus being applied to a GPU graphics card, wherein:

the acquisition module is configured to acquire a first output parameter output by the single-target tracking algorithm network model based on the characteristic image, wherein the first output parameter comprises a target coordinate deviation value and a confidence coefficient;

a first generation module configured to generate a multi-size preset frame and a Hanning window value in parallel in a cuda kernel, wherein the multi-size preset frame corresponds to the size of the feature image;

the second generation module is configured to perform scale punishment and Hanning window processing on the confidence coefficient in the cuda kernel function to obtain a maximum confidence coefficient and an index value;

and the third generation module is configured to determine coordinate offset in the cuda kernel function based on the target coordinate offset value, so as to obtain a target coordinate corresponding to the feature image.

According to another aspect of the embodiments of the present application, there is provided an electronic device including:

a memory for storing executable instructions; and

and the display is used for displaying with the memory to execute the executable instruction so as to complete the operation of any one of the processing methods for accelerating Anchor-based data in parallel by using cuda.

According to a further aspect of the embodiments of the present application, there is provided a computer-readable storage medium for storing computer-readable instructions, which when executed, perform any one of the above operations of the processing method for accelerating Anchor-based data in parallel by using cuda.

The method is applied to a GPU display card, and comprises the steps of obtaining a first output parameter output by a single-target tracking algorithm network model based on a characteristic image, wherein the first output parameter comprises a target coordinate deviation value and a confidence coefficient; generating a multi-size preset frame and a Hanning window value in parallel in the cuda kernel function, wherein the multi-size preset frame corresponds to the size of the characteristic image; carrying out scale punishment and Hanning window processing on the confidence coefficient in the cuda kernel function to obtain the maximum confidence coefficient and an index value; and determining coordinate offset in the cuda kernel function based on the target coordinate offset value to obtain a target coordinate corresponding to the characteristic image. By applying the technical scheme of the application, the output value of the network model of the single-target tracking algorithm can be transferred to the kernel function on the GPU display card, so that the aim of parallelly processing all subsequent computing tasks by the kernel function is fulfilled. Therefore, the problem of low processing efficiency caused by the fact that the model output parameters need to be copied to the CPU for calculation in the prior art can be solved.

The technical solution of the present application is further described in detail by the accompanying drawings and examples.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description, serve to explain the principles of the application.

The present application may be more clearly understood from the following detailed description with reference to the accompanying drawings, in which:

fig. 1 is a schematic diagram of a processing method for accelerating Anchor-based data by using cuda in parallel according to the present application;

FIG. 2 is a schematic diagram of feature points in a processing process of parallel accelerated Anchor-based data by using cuda according to the present application;

fig. 3 is a schematic structural diagram of an electronic device for processing Anchor-based data by using cuda parallel acceleration according to the present application;

fig. 4 is a schematic structural diagram of an electronic device for processing Anchor-based data by using cuda parallel acceleration according to the present application.

Detailed Description

Various exemplary embodiments of the present application will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present application unless specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the application, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

In addition, technical solutions between the various embodiments of the present application may be combined with each other, but it must be based on the realization of the technical solutions by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination of technical solutions should be considered to be absent and not within the protection scope of the present application.

It should be noted that all the directional indicators (such as upper, lower, left, right, front and rear … …) in the embodiment of the present application are only used to explain the relative position relationship between the components, the motion situation, etc. in a specific posture (as shown in the drawings), and if the specific posture is changed, the directional indicator is changed accordingly.

A processing method for performing parallel acceleration of Anchor-based data using cuda according to an exemplary embodiment of the present application is described below with reference to fig. 1-2. It should be noted that the following application scenarios are merely illustrated for the convenience of understanding the spirit and principles of the present application, and the embodiments of the present application are not limited in this respect. Rather, embodiments of the present application may be applied to any scenario where applicable.

The application also provides a processing method and device for accelerating Anchor-based data in parallel by using cuda, electronic equipment and a medium.

Fig. 1 schematically shows a flow chart of a processing method for accelerating Anchor-based data by using cuda in parallel according to an embodiment of the present application. As shown in fig. 1, the method is applied to a GPU graphics card, wherein:

s101, obtaining a first output parameter output by the single-target tracking algorithm network model based on the characteristic image, wherein the first output parameter comprises a target coordinate deviation value and a confidence coefficient.

In the related art, a single-target tracking algorithm based on a deep Convolutional Neural Network (CNN) makes breakthrough progress in the field of computer vision, the application precision of many algorithms gradually reaches the standards of industrialization and commercialization, and the front-end reasoning speed of the algorithms is pursued by the industry while pursuing the precision. The deep neural network has a large number of levels and nodes, so it is very important to consider how to reduce the required memory and the required computation amount, especially for edge computation carrier boards with weaker computation performance.

For the acceleration idea of the single-target tracking algorithm, 2 types are common practice. One of the methods is a mainstream front-end reasoning framework such as Libtorch, TensrT, OpenCV and the like, which focuses on optimizing the forward reasoning speed of a deep neural network backbone network by using a GPU, and provides low-delay and high-throughput deployment reasoning for a deep learning algorithm. The second method is a deep neural network compression and acceleration method, parameter pruning and sharing, low-rank decomposition, migration/compression convolution filter, knowledge refining and the like. Parameter pruning (parameter pruning) and sharing based methods focus on exploring redundant parts of model parameters and attempt to remove redundant and unimportant parameters; a method based on a Low-rank decomposition (Low-rank decomposition) technique uses matrix/tensor decomposition to estimate the most informative parameters in the deep CNN; designing a convolution filter with a special structure based on a migration/compression convolution filter (transformed/compact conditional filters) method to reduce the complexity of storage and calculation; knowledge refinement (knowledge refinement) learns a refinement model, i.e. trains a more compact neural network to reproduce the output results of a large network.

At present, the acceleration of a deep neural network algorithm model focuses on accelerating a network forward reasoning process by means of GPU (graphics processing unit) computing power or cutting and lightening the network model based on a compression and acceleration deep neural network method, so that the speed is increased. For the data stored in the GPU device after the neural network model inference, the subsequent processing is generally copying from the device to the host, and then performing data processing at the CPU, which can be understood as the above-described manner greatly reduces the service processing efficiency.

Further, the method can firstly construct a convolutional neural network TracerNet of the single target tracking algorithm, and output a target coordinate offset value offset and a confidence score corresponding to Anchor Box.

Specifically, for a single target tracking algorithm network model, one approach may be a two-branch neural network structure, which includes dynamic convolution kernel template branches and target tracking branches. It should be noted that the network models can be forward reasoning networks constructed based on Libtorch frameworks, and are obtained by Pytorch model transformation. Or a network model obtained by converting the ONNX model into a forward reasoning network constructed based on a TensorRT framework.

In one mode, the feature image input into the network model of the single target tracking algorithm may be a video sequence image, and the other input may also be an initial target coordinate or a target coordinate Roi [ x, y, w, h ] locked in the previous frame of image. And the output presets a target coordinate offset value offset and a confidence score corresponding to the frame Anchor Box for each frame of image.

Further, the present application may define the kernel function as offsets, scores = tracker net (Frame, Roi), and may be executed on the CUDA kernel function by means of the AI algorithm framework.

And S102, generating a multi-size preset frame and a Hanning window value in parallel in the cuda kernel function, wherein the multi-size preset frame corresponds to the size of the characteristic image.

Specifically, in the embodiment of the present application, a kernel function may be further defined to implement multiscale Scale = [ s1, s2, …, sn ] corresponding to a feature map Size = [ width, height ], to preset frame cuAnchorBoxes and hanning window cuHanning values.

Further, for generating the multi-size preset frame, for each mesh on the feature map, the multi-size preset frame with different shapes and sizes corresponding to each mesh of the Anchor Boxes may be generated in parallel on the kernel function.

In one approach, the application can generate hanning window values of the same size as the feature map size in parallel on the CUDA kernel. The formula for generating the hanning window may be:

p_j=0.5*(1–cos(PI*i/((j–1))));

0 ≦ i ≦ j, j ∈ { width, height }, and PI =3.14159 further, for defining the kernel function as cuAnchorBoxes, cuHanning = preKernel (Scale, Size). And further executing the operation on the CUDA core in parallel.

And S103, carrying out scale punishment and Hanning window processing on the confidence coefficient in the cuda kernel function to obtain the maximum confidence coefficient and an index value.

In one mode, the embodiment of the application can implement scale punishment and hanning window processing of the confidence coefficient in parallel on the kernel function to obtain the maximum confidence coefficient and the index value index. Specifically, the method and the device need to generate the scale penalty coefficient penalty corresponding to the confidence degree size on the CUDA core in parallel. And further, punishing the confidence level according to the scale punishment coefficient.

Further, hanning window processing of the confidence value can be optimized based on parallel computing, so that the maximum confidence score and the index value index are obtained.

Finally, in the embodiment of the present application, the kernel function may be defined as:

score, index = scoreKernel (cuanchromboses, cuHanning, offsets, scores) and operations are performed in parallel on CUDA cores.

And S104, determining coordinate offset in the cuda kernel function based on the target coordinate offset value to obtain a target coordinate corresponding to the characteristic image.

Further, the embodiment of the present application may optimize the coordinate offset of the target coordinate offset value based on parallel computation, so as to obtain the target coordinate Roi [ x, y, w, h ]. And the kernel function in the embodiment of the present application can also be defined as:

roi = roiKernel (offsets, index), and operations are performed in parallel on the CUDA core.

Optionally, in another embodiment based on the method of the present application, obtaining a first output parameter of the single-target tracking algorithm model output based on the feature image includes:

constructing the single target tracking algorithm network model;

Further, the method for parallel acceleration of CUDA post-processing of the single target tracking algorithm Based on Anchor-Based is adopted, wherein a convolutional neural network of the single target tracking algorithm needs to be constructed firstly, and a two-branch network structure model is constructed for a Libtrch frame.

Still further, the preset frame and the Hanning window need to be generated in the kernel function in an optimized mode in parallel. And carrying out parallel optimization and scale punishment and Hanning window processing on the confidence coefficient so as to obtain the maximum confidence coefficient and an index value. Moreover, coordinate scale conversion in the kernel function needs to be optimized in parallel to achieve the purpose of updating the target coordinate. And finally, optionally, iteratively executing a subsequent image sequence to search the target until a target tracking task is completed.

It can be understood that by applying the above technical means, memory copy between the server and the device can be reduced, data transmission bandwidth is saved, the computational efficiency is improved by efficiently utilizing the CUDA parallel architecture, and the inference speed and real-time performance of the algorithm are improved.

In one mode, as shown in fig. 2, fig. 2 is a schematic diagram of generating multiple preset borders in parallel in the cuda kernel, and as can be seen from fig. 2, each feature image includes multiple borders with different sizes at feature points.

The processing method for accelerating Anchor-based data in parallel by using cuda comprises the following steps:

the first step is as follows: in one mode, the algorithm of the embodiment adopts a DaSiamRPN single-target tracking algorithm or a Libtorch Script inference framework network model.

Further, the convolutional neural network of the target tracking algorithm inputs the image in the video sequence, the initial target pixel coordinate or the target pixel coordinate locked by the previous frame of image, and outputs the coordinate offset value and the confidence corresponding to the Anchor Box.

The second step is that: and generating multi-scale preset frames cuAnchorBoxes and a Hanning window cuHanning corresponding to the size of the feature map.

Further, cuAnchorBoxes, cuHanning = preKernel (Scale, Size) kernel was constructed, where Scale = [0.33,0.5,1,2,3], Size = [19,19 ].

Parallelly realizing a preset frame Roi [ x, y, w, h ] of the cuAnchorBoxes in a kernel function, wherein the size of the preset frame is [19,19,5,4 ]; wherein cuAnchorBoxes [ i ]. x = xGrid, cuAnchorBoxes [ i ]. y = uGrid, cuAnchorBoxes [ i ]. w = wScale, and cuAnchorBoxes [ i ]. h = hScale. Wherein i = threadidx.x + blockidx.x × blockdim.x, xGrid, yGrid, wScale, and hScale are obtained by Scale transformation;

further, cuHanning is implemented in parallel in the kernel function, wherein cuHanning [ i ] = transit (p _ width) [ i ]. p _ height [ i ], wherein i = readidx.x + blockidx.x.blockdim.x.

The third step: and carrying out scale punishment and Hanning window processing on the confidence scores in the kernel function to obtain the maximum confidence score and the index value index.

Further, a kernel function is constructed:

score,index=scoreKernel(cuAnchorBoxes,cuHanning,offsets,scores)

further, a scale penalty is added to the confidence scores based on parallel computing optimization, scores [ i ] = scores [ i ]. dependency [ i ], wherein i = readidx.x + blockidx.x. blockdim.x, and the instance penalty is calculated from a target offset;

further, based on parallel computing optimization, Chinese window processing is carried out on confidence scores, wherein scores [ j ] = CuHanning [ j ]. scores [ j ], wherein j = threadIdx.x + blockIdx.x. blockDim.x, and the maximum confidence score and the index value index are obtained based on a parallel optimization thrast library in CUDA;

the fourth step: carrying out scale conversion on the offsets coordinates in the kernel function to obtain target coordinates Roi [ x, y, w, h ];

further, constructing a kernel function Roi = roiKernel (offsets, index), calculating to obtain a target position, and then copying the target position from the GPU device to the host; the scale conversion algorithm comprises the following steps of (1) calculating the scale conversion algorithm, wherein the scale conversion algorithm comprises the following steps of (i) Roi [ i ]. x = x, Roi [ i ]. y = y, Roi [ i ]. w = w, and Roi [ i ]. h = h;

the fifth step: and iteratively executing the subsequent image sequence to search the target until the target tracking task is completed.

Further, perform offsets, scores = tracker net (Frame, Roi);

further, cuAnchorsBoxes, cuHanning = preKernel < < < blocks, threads > > (Scale, Size) are executed, wherein thread blocks =19 × 5, and the number of threads is = 1; further, score, index = score kernel < < < blocks, reads > > (cuAnchorBoxes, cuHanning, offsets, scores) is performed.

Wherein, the thread blocks =19 × 5, and the thread number threads = 1;

further, Roi = roiKernel < < < blocks, threads > > (offsets, index) is performed,

the thread blocks =1, and the thread number threads = 1.

In another embodiment of the present application, as shown in fig. 3, the present application further provides a processing apparatus for accelerating Anchor-based data in parallel by using cuda. The method is applied to a GPU display card, wherein:

an obtaining module 201 configured to obtain a first output parameter output by the single-target tracking algorithm network model based on the feature image, where the first output parameter includes a target coordinate offset value and a confidence level;

a first generating module 202 configured to generate a multi-size preset frame and a hanning window value in parallel in the cuda kernel, wherein the multi-size preset frame corresponds to the size of the feature image;

the second generating module 203 is configured to perform scale penalty and hanning window processing on the confidence coefficient in the cuda kernel function to obtain a maximum confidence coefficient and an index value;

a third generating module 204, configured to determine a coordinate offset in the cuda kernel function based on the target coordinate offset value, and obtain a target coordinate corresponding to the feature image.

In another embodiment of the present application, the obtaining module 201 further includes:

an obtaining module 201 configured to construct the single target tracking algorithm network model;

the obtaining module 201 is configured to input the feature image and the initial target coordinate value into the single target tracking algorithm network model, so as to obtain the target coordinate offset value and the confidence.

In another embodiment of the present application, the first generating module 202 further includes:

a first generating module 202 configured to generate the multi-sized preset bounding box of the corresponding shape and the corresponding size in parallel in the cuda kernel based on each mesh of the feature image; and the number of the first and second groups,

a first generating module 202 configured to generate the hanning window values of the same size as the feature image in parallel in the cuda kernel.

In another embodiment of the present application, the second generating module 203 further includes:

a second generating module 203, configured to generate a scale penalty coefficient corresponding to the feature image size in parallel in the cuda kernel function, and perform a scale penalty on the confidence by using the scale penalty coefficient; and the number of the first and second groups,

a second generating module 203 configured to generate hanning window processing in parallel in the cuda kernel function, resulting in the maximum confidence and an index value.

In another embodiment of the present application, the third generating module 204 further includes:

a third generating module 204, configured to perform scale transformation on the target coordinate offset value in the cuda kernel function to obtain a target coordinate corresponding to the feature image.

In another embodiment of the present application, the cuda kernel functions include a preKernel kernel function, a scoreKernel kernel function, and a roiKernel kernel function.

Fig. 4 is a block diagram illustrating a logical structure of an electronic device in accordance with an exemplary embodiment. For example, the electronic device 300 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

In an exemplary embodiment, there is also provided a non-transitory computer-readable storage medium, such as a memory, including instructions executable by an electronic device processor to perform the above-described processing method for accelerating Anchor-based data in parallel by cuda, the method being applied on a GPU graphics card, wherein: acquiring a first output parameter output by a single-target tracking algorithm network model based on a characteristic image, wherein the first output parameter comprises a target coordinate deviation value and a confidence coefficient; generating a multi-size preset frame and a Hanning window value in parallel in a cuda kernel function, wherein the multi-size preset frame corresponds to the size of the characteristic image; carrying out scale punishment and Hanning window processing on the confidence coefficient in the cuda kernel function to obtain a maximum confidence coefficient and an index value; and determining coordinate offset in the cuda kernel function based on the target coordinate offset value to obtain a target coordinate corresponding to the characteristic image. Optionally, the instructions may also be executable by a processor of the electronic device to perform other steps involved in the exemplary embodiments described above. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, there is also provided an application/computer program product including one or more instructions executable by a processor of an electronic device to perform the above processing method for parallel acceleration of Anchor-based data by cuda, the method being applied on a GPU graphics card, wherein: acquiring a first output parameter output by a single-target tracking algorithm network model based on a characteristic image, wherein the first output parameter comprises a target coordinate deviation value and a confidence coefficient; generating a multi-size preset frame and a Hanning window value in parallel in a cuda kernel function, wherein the multi-size preset frame corresponds to the size of the characteristic image; carrying out scale punishment and Hanning window processing on the confidence coefficient in the cuda kernel function to obtain a maximum confidence coefficient and an index value; and determining coordinate offset in the cuda kernel function based on the target coordinate offset value to obtain a target coordinate corresponding to the characteristic image. Optionally, the instructions may also be executable by a processor of the electronic device to perform other steps involved in the exemplary embodiments described above.

Fig. 4 is an exemplary diagram of an electronic device 300. Those skilled in the art will appreciate that the schematic diagram 4 is merely an example of the electronic device 300 and does not constitute a limitation of the electronic device 300 and may include more or less components than those shown, or combine certain components, or different components, for example, the electronic device 300 may also include input-output devices, network access devices, buses, etc.

The processor 301 may be referred to as a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU) for short. Wherein the processor 301 is the control center of the electronic device 300 and connects the various parts of the entire electronic device 300 using various interfaces and lines.

The memory 302 may be used to store computer readable instructions and the processor 301 may implement various functions of the electronic device 300 by executing or executing computer readable instructions or modules stored in the memory 302 and by invoking data stored in the memory 302. The memory 302 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to the use of the electronic device 300, and the like. In addition, the Memory 302 may include a hard disk, a Memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Memory Card (Flash Card), at least one disk storage device, a Flash Memory device, a Read-Only Memory (ROM), a Random Access Memory (RAM), or other non-volatile/volatile storage devices.

The modules integrated by the electronic device 300 may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by the present invention, and can also be realized by relevant hardware through computer readable instructions, which can be stored in a computer readable storage medium, and when the computer readable instructions are executed by a processor, the steps of the above described method embodiments can be realized.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A processing method for accelerating Anchor-based data in parallel by using cuda is characterized by being applied to a GPU (graphics processing Unit) display card, wherein:

2. The method of claim 1, wherein obtaining a first output parameter of the single-target tracking algorithm network model based on the feature image output comprises:

constructing the single target tracking algorithm network model;

3. The method of claim 1, wherein the generating the multi-size preset bounding box and the hanning window value in parallel in the cuda kernel function comprises:

4. The method of claim 1, wherein the performing a scale penalty and a hanning window on the confidence level in the cuda kernel function to obtain a maximum confidence level and an index value comprises:

5. The method as claimed in claim 1, wherein the determining a coordinate offset in the cuda kernel based on the target coordinate offset value to obtain a target coordinate corresponding to the feature image comprises:

6. The method of any one of claims 1-5, wherein the cuda kernel functions comprise a preKernel kernel function, a scoreKernel kernel function, and a roiKernel kernel function.

7. A processing device for accelerating Anchor-based data in parallel by using cuda is applied to a GPU (graphics processing Unit) display card, wherein:

8. An electronic device, comprising:

a memory for storing executable instructions; and the number of the first and second groups,

a processor for displaying with the memory to execute the executable instructions to complete the operations of any one of claims 1-6 utilizing cuda to accelerate the processing method of Anchor-based data in parallel.

9. A computer-readable storage medium storing computer-readable instructions, wherein the instructions, when executed, perform the operations of any one of claims 1 to 6 using cuda to parallel accelerate the processing method of Anchor-based data.