CN115546009B

CN115546009B - Optimization method, device and equipment of non-maximum suppression algorithm and storage medium

Info

Publication number: CN115546009B
Application number: CN202211508128.XA
Authority: CN
Inventors: 王晓芸; 谢琦; 刘海峰; 李乐乐; 王子磊
Original assignee: Hefei Zhongke Leinao Intelligent Technology Co ltd
Current assignee: Hefei Zhongke Leinao Intelligent Technology Co ltd
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2023-02-03
Anticipated expiration: 2042-11-29
Also published as: CN115546009A

Abstract

The invention discloses a method, a device and equipment for optimizing a non-maximum suppression algorithm and a storage medium. The method comprises the following steps: sorting the N candidate frames stored in the global memory based on the GPU threads in the M first GPU thread groups, comprising: determining the number of sorting turns and the number of first GPU thread groups used by each sorting turn according to the N and the parallelism of the first target; in the ith sorting round, each GPU thread in each first GPU thread group is selected from sorting results of the ith-1 sorting round to be 2 ⁱ Sorting the candidate frames; and counting the sorted candidate frames based on GPU threads in K second GPU thread groups, wherein K is determined according to N and the parallelism of a second target, grouping the sorted candidate frames according to the counting result, calculating an intersection comparison IOU matrix for each group of candidate frames, and determining a target detection frame from the N candidate frames according to the IOU matrix.

Description

Optimization method, device and equipment of non-maximum suppression algorithm and storage medium

Technical Field

The application relates to the technical field of image detection, in particular to an optimization method of a non-maximum suppression algorithm.

Background

The non-maximum suppression algorithm is an algorithm for removing redundant non-maximum targets in a local area to search for local maximum targets, and is commonly used as a method for screening candidate frames in a deep learning target detection algorithm with the rapid development of deep learning in computer vision tasks (e.g., target detection, target segmentation, etc.).

In the related art, the operation of the non-maximum suppression algorithm is realized by adopting a combined execution mode of a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU), data copy is frequently performed between the CPU and the GPU during the operation of the non-maximum suppression algorithm, and when the amount of processed data is huge, the data copy duration is increased, which reduces the detection efficiency of the target detection box.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art. Therefore, an object of the present invention is to provide an optimization method having a non-maximum suppression algorithm capable of improving the detection efficiency of a target detection frame.

The optimization method of the non-maximum suppression algorithm according to the embodiment of the invention comprises the following steps:

sorting the N candidate boxes stored in the global memory based on the GPU threads in the M first GPU thread groups, including:

determining the number of sorting turns and the number of first GPU thread groups used by each sorting turn according to the N and the first target parallelism, wherein M is the sum of the number of the first GPU thread groups used by each sorting turn;

in the ith sorting round, each GPU thread in each first GPU thread group is selected from sorting results of the ith-1 sorting round to be 2 ⁱ Sorting the candidate frames, wherein i is a positive integer;

counting the sorted candidate frames based on GPU threads in K second GPU thread groups, wherein K is determined according to N and the parallelism of a second target;

and grouping the sorted candidate frames according to the statistical result, calculating an intersection and comparison IOU matrix for each group of candidate frames, and determining a target detection frame from the N candidate frames according to the IOU matrix.

In the above scheme, in the ith sorting round, the candidate frames with the same category are sorted in a descending order according to the confidence level, and the candidate frames with different categories are sorted in an ascending order according to the category values.

In the foregoing solution, the counting the sorted candidate frames based on the GPU threads in the K second GPU thread groups includes:

dividing the counting process into two counting rounds, and setting K-1 groups of shared memory blocks corresponding to K-1 second GPU thread groups used by the first counting round, wherein each group of shared memory blocks comprises a first shared memory block and a second shared memory block;

in the first statistical turn, executing multiple times of candidate frame loading, and in each time of loading, using a preset thread in the current second GPU thread to load the front P in the current candidate frame sequence _count Storing the candidate frames into a corresponding first shared memory block, loading the candidate frames from the first shared memory block through GPU threads in a current second GPU thread group for comparison, and storing the comparison result into a corresponding second shared memory block, wherein P is _count For the second target parallelism, the current candidate frame sequence is a sequence formed by the current residual candidate frames of the sorted candidate frames;

in the second counting turn, the storage data of each second shared memory block is counted by the GPU threads in the remaining second GPU thread group, and the counting result is stored in the global memory.

In the foregoing solution, the comparison result is a location interval of each category candidate frame in the corresponding first shared memory block, and the statistical result is a location interval of each category candidate frame in the sorted candidate frames in the global memory.

In the above scheme, when candidate frames are loaded from the first shared memory block by the GPU threads in the current second GPU thread group for comparison, each GPU thread compares two adjacent candidate frames according to its own index value;

when the two candidate frames to be compared are different in category, the value obtained by adding 1 to the category of the former candidate frame and the index value of the GPU thread used in the two candidate frames is stored in the corresponding second shared memory block.

In the above scheme, grouping the sorted candidate frames according to the statistical result includes:

and grouping the sorted candidate frames according to the position interval of each category of candidate frames in the global memory in the sorted candidate frames, wherein the candidate frames in each group have the same category.

In the above scheme, P _sort Is a first target parallelism, P _count For the second target parallelism, the number of first GPU thread groups M = (N + P) _sort -1)/P _sort The number of second GPU thread groups K = (N + P) _count -1)/P _count +1。

The invention also provides an optimization device of the non-maximum suppression algorithm, which comprises the following steps:

the sorting module is used for sorting the N candidate frames stored in the global memory based on GPU threads in the M first GPU thread groups, wherein the sorting module is also specifically used for the number of the GPU thread groups, and M is the sum of the number of the first GPU thread groups used in each sorting turn; in the ith sorting round, each GPU thread in each first GPU thread group is selected from sorting results of the ith-1 sorting round to be 2 ⁱ Sorting the candidate frames, wherein i is a positive integer;

the counting module is used for counting the sorted candidate frames based on GPU threads in K second GPU thread groups, wherein K is determined according to N and the parallelism of a second target;

and the determining module is used for grouping the sorted candidate frames according to the statistical result, calculating an intersection ratio IOU matrix for each group of candidate frames respectively, and determining a target detection frame from the N candidate frames according to the IOU matrix.

The invention also provides an electronic device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the optimization method of the non-maximum suppression algorithm when executing the computer program.

The present application also provides a computer readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the optimization method of the non-maximum suppression algorithm described above.

According to the optimization method of the non-maximum suppression algorithm, the whole algorithm process of the non-maximum suppression algorithm can be executed end to end on the GPU, namely, other operations except data input and data output are all executed on the GPU, unnecessary data copying can be reduced in the execution process of the non-maximum suppression algorithm, the screening time of the target detection frame is shortened, and the screening efficiency of the target detection frame is further improved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

FIG. 1 is an overall block diagram of a non-maxima suppression algorithm in one embodiment;

FIG. 2 is a schematic flow chart diagram of a method for optimizing a non-maxima suppression algorithm in one embodiment;

FIG. 3 is a flow diagram that illustrates performing a sort ordering on 8 candidate box parallelizations in one embodiment;

FIG. 4 is a flow diagram of a set of second GPU threads performing a first round of statistics on candidate frame parallelization in one embodiment;

fig. 5 is a block diagram of an optimization apparatus for a non-maxima suppression algorithm in one embodiment.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

Before describing an embodiment of the present invention, a non-maximum suppression algorithm used in image detection in the related art is first described in the related art.

The non-maximum suppression algorithm is generally used as a screening method of a detection frame in a deep learning target detection algorithm, in practical application, the training time of the algorithm and the running time of the algorithm are required to be shortened, and the non-maximum suppression algorithm needs to be executed in series due to the strong dependence between data of the non-maximum suppression algorithm, so that the training efficiency and the running efficiency of the algorithm are influenced.

The non-maxima suppression algorithm may be divided into three execution phases: sorting, sorting and screening. Specifically, the execution flow of the non-maximum suppression algorithm is as follows: first of all, for the class 0,

step 1), taking out all candidate frames belonging to the category 0 from the input candidate frames through classification;

step 2), sorting the candidate frames obtained in the step 1) in a descending order according to the confidence degree through sorting;

step 3), screening the sorting result obtained in the step 2) based on the standard of inhibiting the non-maximum value to obtain a final result;

and 4) respectively executing the steps 1) to 3) for other categories, so that the target detection frame can be obtained through screening.

In the related art, in a development scenario, parallelization of a calculation process of an Intersection Over Unity (IOU) matrix which consumes much time in an NMS algorithm is realized by adopting a CPU + GPU manner, that is, calculation of the IOU matrix in a non-maximum suppression algorithm is performed on a GPU, and the rest is performed on a CPU, wherein the IOU matrix is used for screening candidate frames, that is, the GPU performs step 3) in the non-maximum suppression algorithm, and the CPU performs steps 1) and 2) in the non-maximum suppression algorithm.

In a deployment scenario, there are usually a plurality of computer vision tasks combined, for example, to complete a digital meter reading task, it may be necessary to first find the position of the electronic screen by using target detection, and then read the content on the electronic screen by using an Optical Character Recognition (OCR) algorithm, where there may be some complex operations involved between the two tasks. If the parallelization thought in the development scene is continued, only a mode of joint execution of the CPU and the GPU can be adopted, wherein a core algorithm is executed on the GPU, and in the parallelization mode, frequent data copying between the CPU and the GPU is difficult to avoid, and when the data volume is large, the data copying needs to take a long time.

The execution flow of the non-maximum suppression algorithm is combed, and the classification step and the sorting step can be combined and executed. As shown in fig. 1, fig. 1 shows an overall architecture block diagram of a non-maximum suppression algorithm, which includes a sorting module, a counting module and a screen frame module, wherein the sorting module is a combination of a classification module and a sorting module in the related art, that is, the sorting module can realize classification and sorting in the non-maximum suppression algorithm. The counting module is used to prepare input data for the reel module. In the screen frame module, a parallelization method of the non-maximum suppression algorithm in a development-oriented scenario, for example, a non-maximum suppression algorithm applied to a general parallel computing Architecture (CUDA) may be used. In fig. 1, the sequencing module, the technology module, and the reel module all run on the GPU.

The following describes details of implementation of the technical solution of the embodiment of the present invention with reference to the drawings.

In one embodiment, as shown in fig. 2, a method for optimizing a non-maximum suppression algorithm is provided, and the non-maximum suppression algorithm may include the steps of:

step S201, rank the N candidate frames stored in the global memory based on the GPU threads of the M first GPU thread groups.

Here, the first target parallelism for sorting is set to P _sort According to the number N of candidate frames and the first target parallelism P _sort Determining the number M of the first GPU thread group, namely M = (N + P) _sort -1)/P _sort Setting the number of GPU threads of each first GPU thread group and the parallelism P of the first target _sort The same, so that the N candidate frames stored in the global memory can be processed in parallel by utilizing M first GPU thread groupsAnd obtaining the sorting result of the N candidate frames.

Step S201 will be described in detail.

Here, it is necessary to perform multiple sorting rounds on the N candidate frames stored in the global memory to obtain a sorting result of the N candidate frames, where the number T of the sorting rounds is determined by the number N of the candidate frames, specifically, T = log (N). In practical application, the number of the first GPU thread groups used in different sorting turns is changed, wherein the number of the first GPU thread groups used in each sorting turn is determined according to the number N of the candidate frames and the parallelism P of the first target _sort And determining that the sum of the number of the first GPU thread groups used in each sorting turn is M in practical application.

In each sorting turn, the number of candidate frames processed by the first GPU thread group is determined according to the sorting turn, and in the ith sorting turn, the number of candidate frames processed by each first GPU thread group, stride =2, is used ⁱ That is, in the first round of ordering, each first GPU thread group used processes two candidate frames, in the second round of ordering, each first GPU thread group used processes four candidate frames, and so on. And when i =1, selecting the candidate frame input into the first GPU thread group from the original ordering of the candidate frame. In each sorting turn, two candidate frames are regarded as a group, and sorting is performed in each sorting turn according to a pre-used sorting method, for example, a merge sorting method, a bubble sorting method, and the like may be selected. In practical application, the categories and the confidences of the candidate frames are ranked in the ranking turn.

In one embodiment, a dual condition is set for the ordering of the candidate frames, specifically, in the ith ordering round, each GPU thread in each first GPU thread group used will be 2 for each GPU thread in the first GPU thread group used ⁱ The category and confidence of each candidate frame are sorted, first, each first GP usedEach GPU thread pair 2 in the U thread group ⁱ And sorting the categories of the candidate frames, if the categories are different, sorting according to the ascending order of the category values, and finishing the sorting. If the candidate boxes belong to the same category, the candidate boxes are ranked from high to low according to the confidence degrees of the candidate boxes. During the sorting process, if the positions of the two candidate boxes need to be swapped, the swapping is performed in place. Therefore, after the sorting processing of T sorting rounds, the N candidate frames can be sorted from small to large according to the category values, and the candidate frames in the same category are also sorted from large to small according to the confidence degrees.

A flow chart for performing the sort ordering on the 8 candidate box parallelization is shown in fig. 3. Parallel execution of sort ordering is described in conjunction with the specific example of fig. 3.

In fig. 3, T =1, T =2, and T =3 represent different sorting rounds, stride represents the number of candidate boxes that the GPU thread in each first GPU thread group needs to process in each sorting round, a thread index value is used to distinguish different GPU threads in the first GPU thread group, and a data index may represent the sorting of the input candidate boxes, for example, a data index 0 represents a first candidate box C1|0.2, C1 in C1|0.2 represents a category value of the candidate box, and 0.2 represents a confidence of the candidate box, that is, what needs to be compared in the sorting.

In fig. 3, 8 candidate frames are ranked, where 3 ranking rounds are required to obtain the ranking results of 8 candidate frames. In each sorting turn, each GPU thread in each first GPU thread group performs sorting operation on Stride candidate frames, each time 2 candidate frames are taken for comparison, the category dereferencing sizes Ci of the two candidate frames are firstly compared, if Ci of the two candidate frames is not the same, the two candidate frames are sorted according to the order from small to large of the category dereferencing Ci, for example, in the first sorting turn, the GPU thread of the first GPU thread group with an index value of 0 takes the first candidate frame and the second candidate frame for sorting processing, wherein the category dereferencing of the first candidate frame and the second candidate frame is not the same, and the category dereferencing of the first candidate frame is C1, and the category dereferencing of the second candidate frame is C2, then the corresponding sorting result is that the first candidate frame is first, the second candidate frame is last, wherein the sorting result can be represented as "C1|0.2, C2|0.3", and the sorting turn is ended. If Ci of the two candidate frames is the same, the two candidate frames are sorted in an order from the confidence degrees of the two candidate frames to the minimum, for example, in a first sorting round, the GPU thread of the first GPU thread group with an index value of 1 selects a third candidate frame and a fourth candidate frame for sorting, where the category of the third candidate frame is the same as that of the fourth candidate frame, the confidence degree of the third candidate frame is 0.5, and the confidence degree of the fourth candidate frame is 0.9, then the corresponding sorting result is "C1|0.9, C1|0.5", and the sorting round is ended.

In the embodiment, through the sorting of the parallel execution candidate boxes of the plurality of first GPU thread groups on the GPU, the overall time complexity can be significantly reduced compared to the related art CPU sequential execution manner. In the conventional non-maximum suppression algorithm, if T (n) is used _class Representing the time complexity of the classification module, then T (n) _class = O (n). If T (n) _sort Representing the time complexity of the sorting module, the time complexity of different sorting algorithms is different, wherein the time complexity of the merging sorting is T (n) _sort = O (nlogn), time complexity of bubble sort T (n) _sort =O(n ² ). Sorting and ordering Overall time complexity T (c, n) taking merge ordering as an example _total Comprises the following steps:

T(c,n) _total =c×(T(n) _class +T(n) _sort )=O(c×nlogn) (1)

in this embodiment, the classification module and the sorting module are combined into the sorting module in fig. 1, and after the sorting module processes a plurality of candidate frames, all the input candidate frames are sorted in the order of the small-to-large category value and the large-to-small confidence coefficient of the same category. Taking merge sort as an example, the time complexity of the whole sorting and sorting is as follows:

T(c,n) _total =(T(n) _sort )=O(nlogn) (2)

in the above formula, n represents the number of candidate frames to be processed, c represents the number of categories, and c.gtoreq.1. By comparing the above expression (1) with the above expression (2), it can be determined that the overall time complexity of the non-maximum suppression algorithm in the present embodiment is reduced by one order of magnitude, so that the execution delay of the algorithm can be significantly reduced.

Step S202, counting the sorted candidate frames based on GPU threads in the K second GPU thread groups.

The frame module of FIG. 1 loads candidate frames of the same category as input each time, and the count module is used to bridge the sort module and the subsequent frame module to prepare input data for the subsequent frame module.

Here, a second target parallelism P is set _count According to a second target parallelism P _count And a candidate box number N, determining a number K of second GPU thread groups, wherein K = (N + P) _count -1)/P _count +1. And counting the sorted candidate frames based on GPU threads in the K second GPU thread groups to obtain the initial position and the end position of each type of candidate frame in the sorted candidate frames, so as to prepare input data for a subsequent frame screening module.

Taking fig. 3 as an example, [ C1|0.9, C1|0.8, C1|0.5, C1|0.2, C2|0.5, C2|0.3, C3|0.4, C3|0.2] are candidate frames after sorting, the start position and the end position of the first category C1 are 0 and 4, respectively, and the start position and the end position of the second category C2 are 4 and 6, respectively, if no counting module is provided, i.e. the output of the sorting module is directly used as the input of the frame screening module, the frame screening module needs to calculate the IOU matrix of 8 × 8 size, and needs to calculate 64 times in total, and after the counting module is added, the frame screening module only needs to calculate one IOU matrix of 4 × 4 and two IOU matrices of 2 × 2, and needs to calculate 24 times in total, thereby reducing the amount of data calculation and also reducing the execution time delay of the algorithm.

In one embodiment, in the process of performing statistics on the sorted candidate frames based on the GPU threads in the K second GPU threads, two statistics rounds are divided, and the two statistics rounds are described in detail below.

In the first counting turn, the number of the second GPU thread groups is K-1, wherein each second GPU thread groupGroup includes P _count And (4) each thread. Setting K-1 groups of shared memory blocks for K-1 second GPU thread groups used in a first statistical round, that is, each second GPU thread group corresponds to one group of shared memory blocks, where each group of shared memory blocks includes a first shared memory block and a second shared memory block, each thread in each second GPU thread group corresponds to one memory address in the first shared memory block and the second shared memory block, the first shared memory block is a candidate frame for storing input, the second shared memory block is for storing an output result of the second GPU thread group, and in actual application, the sizes of the first memory block and the second memory block are both P _count . In the statistical process, the time delay caused by frequently loading data from the global memory can be reduced by using the shared memory.

Forming a current candidate frame sequence by the currently remaining candidate frames of the sorted candidate frames, and performing pre-P (front P) in the current candidate frame sequence through a preset GPU thread (a thread with an index value of 0) in a current second GPU thread group _count Storing the candidate frames to the corresponding first shared memory blocks, wherein the P loaded by each group of second GPU threads _count The candidate frames are stored to the storage address set in the first shared memory block, and the candidate frame sequence obtained by the sorting module shown in fig. 3 can be expressed as [ C1|0.9, C1|0.8, C1|0.5, C1|0.2, C2|0.5, C2|0.3, C3|0.4, C3|0.2]Will be before P _count The candidate frames are stored in the corresponding first shared memory block, where the first candidate frame (i.e., C1| 0.9) is stored in the first shared memory block at the memory address corresponding to the first GPU thread (i.e., the thread with the index value of 0) in the second GPU thread group, and so on.

After the preset GPU threads in the second GPU thread group finish copying the candidate frames, the GPU threads in the second GPU threads load the candidate frames from the first shared memory block for comparison, and therefore time delay caused by frequent loading of the candidate frames from the global memory can be reduced.

And the GPU threads in the second GPU threads store the comparison results generated after the loading candidate frames are compared in the second shared memory block. In practical application, the comparison result is a position interval of each category candidate frame in the first shared memory block, and the start position and the end position of each category candidate frame in the first shared memory block can be determined according to the comparison result.

In one embodiment, when candidate frames are loaded from the first shared memory block by the GPU threads in the current second GPU thread group for comparison, each GPU thread compares two adjacent candidate frames according to its own index value. Specifically, each GPU thread processes two adjacent candidate frames, which are respectively a candidate frame stored by a memory address corresponding to a current GPU thread in the first shared memory block and a candidate frame stored by a memory address corresponding to a GPU thread whose index value of the current GPU thread in the first shared memory block plus 1, for example, for a GPU thread whose index value is 0 in the second GPU thread group, the candidate frame is obtained from a memory address corresponding to a GPU thread whose index value is 0 in the first shared memory, the candidate frame is obtained from a memory address corresponding to a GPU thread whose index value is 1 in the first shared memory, and finally, the GPU thread whose index value is 0 compares the first candidate frame of the current candidate frame sequence with the second candidate frame. Each GPU thread mainly compares the categories of the two adjacent candidate frames, and when the GPU thread determines that the categories of the two adjacent candidate frames are different, the category of the former candidate frame in the two candidate frames and the value obtained by adding 1 to the index value of the used GPU thread are stored in the corresponding second shared memory block, so that the termination position of the category in the first shared memory block can be determined. Assuming that the GPU thread with the index value of 2 compares the candidate frame a and the candidate frame B, when the types of the candidate frame a and the candidate frame B are different, the GPU thread stores the type C1 and the value 3 of the candidate frame a into the memory address corresponding to the GPU thread with the index value of 2 in the second shared memory block, then the memory address corresponding to the second shared memory block stores "C1|3", which indicates that the termination position of the candidate frame with the type C1 in the first shared memory block is 3, that is, the type of the first 4 candidate frames stored in the first shared memory block is C1 (the position 0 corresponds to the first candidate frame).

In the second counting turn, the number of the second GPU thread groups is 1, and the second sharing is performed through the second GPU thread groupsAnd counting the storage data of the memory block, and storing a statistical result in the global memory, wherein the statistical result is a position interval of each category of candidate frames in the sorted candidate frames in the global memory, so that the candidate frames of the same category can be extracted as input candidate frames of the frame screening module according to the statistical result. It will be appreciated that the ordered candidate frames form a candidate frame sequence, and in the first statistical round, each second GPU thread group will extract P from the candidate frame sequence _count The candidate frames are subjected to category comparison, so that the comparison result is substantially the P of the candidate frames of each category input to each second GPU thread group _count And counting the comparison result obtained in the first statistical round in the second statistical round, so that the statistical result is the termination position of the candidate frames of each category after sorting.

A flowchart for a second set of GPU threads to perform a first round of statistics on candidate box parallelization is shown in fig. 4. In fig. 4, the first shared memory block is used to store candidate frames, the second shared memory block is used to store comparison results, and Ci represents a category of the candidate frames, where one of the inputs in the last thread is a null input, so that the termination positions of the candidate frames in the category C3 in the 8 candidate frames input to the second GPU thread group can be determined. The process of counting candidate boxes is described in detail below with reference to fig. 4.

Step 1, the thread with index value of 0 in the second GPU thread group loads P from the global memory _count And loading the candidate frame into the first shared memory block, where in the loading process, the candidate frame is loaded into the memory address corresponding to the first shared memory block according to the data index value of the candidate frame, and the data index value may be the order of the candidate frames, for example, when the data index value of the first candidate frame in the current candidate frame sequence is 0, the candidate frame with the data index value of 0 is loaded into the memory address corresponding to the thread of 0 in the first shared memory block, and so on, and the loading of the input candidate frame is completed.

And 2, each thread in the second GPU thread group sequentially fetches two candidate frames from the first shared memory block, wherein the thread with the index value of 0 fetches the two candidate frames of the first shared memory block with the memory address located at the

position

0 and 1 in the first shared memory block, the thread with the index value of 1 fetches the two candidate frames with the memory address located at the

position

1 and 2 in the first shared memory block, and so on. Each thread executes comparison operation on the categories of the two candidate frames, and the specific operation process is as follows: and under the condition that the types of the two candidate frames are different, storing the value of the index value +1 of the current thread into the second shared memory block corresponding to the current thread.

And 3, writing the data stored in the second shared memory block into the global memory by the thread with the index value of 0 in the second GPU thread group, wherein only the second shared memory block with the data stored therein is stored into the global memory. In practical applications, the locations where each of the second GPU thread groups are written into the global memory are different, and the first second GPU thread group is written into [0, P ] of the global memory _count -1]In the memory address corresponding to the interval, the second GPU thread group is written into [ P ] of the global memory _count -1,2×P _count -1]And analogizing in sequence among the memory addresses corresponding to the intervals.

And 4, executing a second counting turn, repeating the steps 1 to 3 in the second counting turn, and combining the comparison results obtained by each second GPU thread group in the first turn to obtain a counting result.

Step S203, grouping the sorted candidate frames according to the statistical result, respectively calculating an IOU matrix for each group of candidate frames, and determining a target detection frame from the N candidate frames according to the IOU matrix.

And grouping the sorted candidate frames according to the statistical result, wherein the sorted candidate frames are mainly grouped according to categories, and the candidate frames of each group are of the same type, so that the candidate frames of each group can be sequentially input into a frame screening module, IOU matrixes of the candidate frames of each group can be respectively calculated, and a target detection frame is determined from N candidate frames according to the IOU matrixes.

In one embodiment, grouping the sorted candidate boxes according to the statistical result is explained in detail.

Here, the statistical result includes the end positions of the candidate frames of different categories in the global memory, when the sorted candidate frames are grouped, the candidate frames with the category value of C1 in the sorted candidate frames may be firstly grouped, the end position of the candidate frame with the category value of C1 in the global memory is determined from the statistical result, and the start position of the candidate frame with the category value of C1 in the global memory is determined according to the end position of the last category, thereby forming a location interval with the category value of C1 in the global memory, and the candidate frame with the category value of C1 is extracted from the sorted candidate frames according to the location interval, and the operation of grouping the sorted candidate frames is completed by analogy in turn.

In the embodiment of the application, the whole flow of the non-maximum suppression algorithm is put on the GPU to be executed end to end, so that the time delay caused by data copying between the CPU and the GPU is overcome, and the operation efficiency is improved. Through experimental comparison, compared with a partial parallelization-oriented non-maximum suppression algorithm in a development scene, the GPU end-to-end non-maximum suppression algorithm provided by the invention has the advantage that the operation efficiency is improved by about 1.2 times.

The non-maximum suppression algorithm is used in a few scenes alone, and is currently used as a module in deep learning target detection algorithm deployment, and the whole algorithm processing flow in the scene is as follows: preprocessing-network inference-post-processing (including non-maxima suppression algorithms and other algorithms), where preprocessing, network inference, and other post-processing algorithms other than non-maxima suppression algorithms are all executed on the GPU. Through experimental comparison, in an actual deployment scene of the deep learning algorithm with multiple networks connected in series, the deeper the model is connected in series, the more remarkable the effect of the optimization method of the non-maximum suppression algorithm provided by the invention is.

In the above embodiment, N candidate frames stored in the global memory are sorted based on GPU threads in the M first GPU threads, the sorted candidate frames are counted based on GPU threads in the K second GPU thread groups, the sorted candidate frames are grouped according to the counting result, the IOU matrix is calculated for each group of candidate frames, and the target detection frame is determined from the N candidate frames according to the IOU matrix, so that the overall process of the non-maximum suppression algorithm can be implemented on the GPU, the data copy between the CPU and the GPU is overcome, the operation efficiency of the algorithm can be improved, and the time delay is reduced.

In one embodiment, there is provided an optimization apparatus of a non-maximum suppression algorithm, and referring to fig. 5, the optimization apparatus 500 of the non-maximum suppression algorithm may include: a ranking module 501, a statistics module 502, and a determination module 503.

The sorting module 501 is configured to sort N candidate frames stored in the global memory based on GPU threads in M first GPU thread groups, where the sorting module 501 is specifically configured to determine, according to N and a first target parallelism, the number of sorting rounds and the number of first GPU thread groups used in each sorting round, where M is a sum of the numbers of first GPU thread groups used in each sorting round; in the ith sorting round, each GPU thread in each first GPU thread group is selected from sorting results of the ith-1 sorting round to be 2 ⁱ Sorting the candidate frames, wherein i is a positive integer; the counting module 502 is configured to count the sorted candidate frames based on GPU threads in K second GPU thread groups, where K is determined according to N and a second target parallelism; the determining module 503 is configured to group the ranked candidate frames according to the statistical result, calculate an intersection-to-parallel ratio IOU matrix for each group of candidate frames, and determine a target detection frame from the N candidate frames according to the IOU matrix.

In one embodiment, the sorting module 501 sorts the candidate frames in the ith sorting round in descending order according to the confidence level and sorts the candidate frames in ascending order according to the category values.

In an embodiment, the statistics module 502 is specifically configured to divide a statistics process into two statistics rounds, and set K-1 groups of shared memory blocks corresponding to K-1 second GPU thread groups used in a first statistics round, where each group of shared memory blocks includes a first shared memory block and a second shared memory block; in the first statistical round, multiple times of loading of the candidate frames are executed, and each time of loading, the front P in the current candidate frame sequence is processed through the elmer county city in the current second GPU thread _count Storing the candidate frames into a corresponding first shared memory block, loading the candidate frames from the first shared memory block through a GPU thread in a current second GPU thread group for comparison, and storing the comparison result into a corresponding second shared memory block, wherein P is _count For the second target parallelism, the current candidate frame sequence is a sequence formed by the current residual candidate frames of the sorted candidate frames; in the second counting turn, the storage data of each second shared memory block is counted by the GPU threads in the remaining second GPU thread group, and the counting result is stored in the global memory.

In an embodiment, the comparison result in the statistical module 502 is a position interval of each category candidate frame in the corresponding first shared memory block, and the statistical result is a position interval of each category candidate frame in the sorted candidate frames in the global memory.

In an embodiment, the statistical module 502 is further configured to compare two adjacent candidate frames according to an index value of each GPU thread when the candidate frames are loaded from the first shared memory block by the GPU thread in the current second GPU thread group for comparison; when the two candidate frames to be compared are different in category, the value obtained by adding 1 to the category of the former candidate frame and the index value of the GPU thread used in the two candidate frames is stored in the corresponding second shared memory block.

In an embodiment, the determining module 503 is specifically configured to group the sorted candidate frames according to a position interval of each category candidate frame in the global memory in the sorted candidate frames, where the candidate frames in each group have the same category.

In one embodiment, P _sort Is a first target parallelism, P _count For the second target parallelism, the number of first GPU thread groups M = (N + P) _sort -1)/P _sort The number of second GPU thread groups K = (N + P) _count -1)/P _count +1。

In one embodiment, an electronic device is provided that includes a memory storing a computer program and a processor that, when executing the computer program, implements a method of optimizing a non-maxima suppression algorithm.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when executed by a processor, is adapted to carry out a method of optimization of a non-maxima suppression algorithm.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A method for optimizing a non-maximum suppression algorithm, comprising:

and grouping the sorted candidate frames according to a statistical result, calculating an intersection and comparison IOU matrix for each group of candidate frames, and determining a target detection frame from N candidate frames according to the IOU matrix.

2. The method of claim 1, wherein in the ith sorting pass, the candidate frames with the same category are sorted in descending order of confidence level, and the candidate frames with different categories are sorted in ascending order of category value.

3. The method of optimizing a non-maxima suppression algorithm of claim 1, wherein the counting the ordered candidate boxes based on GPU threads in the K second groups of GPU threads includes:

in the first statistical round, multiple times of candidate frame loading are executed, and during each loading, the front P in the current candidate frame sequence is loaded through a preset thread in the current second GPU thread _count Storing the candidate frames into a corresponding first shared memory block, loading the candidate frames from the first shared memory block through the GPU threads in the current second GPU thread group for comparison, and storing the comparison result into a corresponding second shared memory block, wherein P is the number of the candidate frames in the first shared memory block, and P is the number of the candidate frames in the second GPU thread group _count For the second target parallelism, the current candidate frame sequence is a sequence formed by the current remaining candidate frames of the sorted candidate frames;

in a second statistical round, the stored data of each second shared memory block is counted by the GPU threads in the remaining second GPU thread group, and a statistical result is stored in the global memory.

4. The method according to claim 3, wherein the comparison result is a position interval of each category candidate frame in the corresponding first shared memory block, and the statistical result is a position interval of each category candidate frame in the sorted candidate frames in the global memory.

5. The method according to claim 3, wherein when candidate frames are loaded from the first shared memory block by GPU threads in the current second GPU thread group for comparison, each GPU thread compares two adjacent candidate frames according to its own index value;

when the two candidate frames to be compared are different in category, the category of the former candidate frame and the value obtained by adding 1 to the index value of the GPU thread in the two candidate frames are stored into the corresponding second shared memory block.

6. The method of optimizing a non-maximum suppression algorithm according to claim 1, wherein said grouping the ranked candidate boxes according to statistics comprises:

and grouping the sorted candidate frames according to the position interval of each type of candidate frame in the global memory in the sorted candidate frames, wherein the candidate frames in each group have the same type.

7. The method of optimizing a non-maxima suppression algorithm of claim 1 wherein P is _sort For the first target parallelism, P _count For the second target parallelism, the number of the first GPU thread groups M = (N + P) _sort -1)/P _sort The number of the second GPU thread group K = (N + P) _count -1)/P _count +1。

8. An apparatus for optimizing a non-maximum suppression algorithm, comprising:

and the determining module is used for grouping the sorted candidate frames according to the statistical result, calculating an intersection and comparison IOU matrix for each group of candidate frames respectively, and determining a target detection frame from the N candidate frames according to the IOU matrix.

9. An electronic device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program performs the steps of the method for optimizing a non-maximum suppression algorithm according to any one of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the optimization method of the non-maxima suppression algorithm according to any one of claims 1 to 7.