CN114581983A

CN114581983A - Detection frame processing method for target detection and related device

Info

Publication number: CN114581983A
Application number: CN202210211664.7A
Authority: CN
Inventors: 葛沅; 赵雅倩; 徐哲; 史宏志; 温东超
Original assignee: Inspur Beijing Electronic Information Industry Co Ltd
Current assignee: Inspur Beijing Electronic Information Industry Co Ltd
Priority date: 2022-03-04
Filing date: 2022-03-04
Publication date: 2022-06-03

Abstract

The application discloses a detection frame processing method for target detection, which comprises the following steps: inputting an original face video stream into a target detection model by taking a frame as a unit to obtain a plurality of initial detection frames and a confidence score corresponding to each initial detection frame; taking the initial detection frame with the confidence coefficient score larger than the confidence coefficient threshold value as a detection frame; determining a cross-over ratio threshold based on a number of the plurality of detection boxes; and carrying out non-maximum suppression processing on the plurality of detection frames based on the intersection ratio threshold and the face coordinate information of each detection frame to obtain the target detection frame. And the repeated redundant detection frames are eliminated, so that the detection deviation caused by the same detection frame when the image with the same scaling scale is repeatedly detected for multiple times is avoided, and the accuracy of target detection is improved. The application also discloses a detection frame processing device for target detection, a server and a computer readable storage medium, which have the beneficial effects.

Description

Detection frame processing method for target detection and related device

Technical Field

The present application relates to the field of object recognition technologies, and in particular, to a detection frame processing method, a detection frame processing apparatus, a server, and a computer-readable storage medium for object detection.

Background

In a face recognition system, when a target data set is detected, a real bounding box of a target person needs to be calibrated as clearly as possible. Generally, when a face image is input into a target detection model, a large number of target frames are analyzed and output. According to the detection branch of the model on the feature pyramid, anchors (detection frames) with different scales are generated at each pixel position of the original image, the specific number of the target frames is determined by the number of the anchors, and a plurality of repeated frames with different scales are positioned on the same target. A standard nms (Non-Maximum Suppression) algorithm, that is, a Non-Maximum Suppression algorithm, or various modified nms algorithms are usually adopted to remove and filter repeated frames and obtain a real target frame.

In the related art, in order to effectively improve the detection accuracy in the target detection process, it is currently common to scale an original image into images with different resolutions, that is, multi-scale targets, and sequentially input the images into a detection model so as to improve the detection accuracy. However, there is a case that if the detection source is a video stream, and the contents of a plurality of video frames in a period of time are not substantially changed, the video stream is repeatedly sent into the detection model for a plurality of times, which is approximately equivalent to that the model is input as a plurality of same detection images with the same scaling scale, and then the filter detection frame is merged by using the vote-nms algorithm, and different detection results can be obtained with a single sending of the original image. That is, when an image with the same scaling scale is repeatedly detected for multiple times, a plurality of frames which are the same in the same position appear, the frames are added into a detection frame set, a voting algorithm is uniformly performed with other detection frames, and finally different detection frames are obtained through output, so that the deviation of a detection result is caused, and the accuracy of target detection is reduced.

Therefore, how to improve the accuracy of the target detection frame is a key issue of attention of those skilled in the art.

Disclosure of Invention

The application aims to provide a detection frame processing method, a detection frame processing device, a server and a computer readable storage medium for target detection, so as to improve the accuracy of a detection frame in a target detection process.

In order to solve the above technical problem, the present application provides a detection frame processing method for target detection, including:

inputting an original face video stream into a target detection model by taking a frame as a unit to obtain a plurality of initial detection frames and a confidence score corresponding to each initial detection frame;

taking the initial detection frame with the confidence score larger than a confidence threshold value as a detection frame;

determining a cross-over ratio threshold based on a number of the plurality of detection boxes;

and carrying out non-maximum suppression processing on the plurality of detection frames based on the intersection ratio threshold and the face coordinate information of each detection frame to obtain a target detection frame.

Optionally, determining the intersection ratio threshold based on the number of the detection boxes includes:

judging whether the number of the detection frames is larger than a number threshold value or not;

if yes, setting a first threshold value as the cross-over ratio threshold value;

if not, setting a second threshold value as the intersection ratio threshold value; wherein the first threshold is greater than the second threshold.

Optionally, performing non-maximum suppression processing on the plurality of detection frames based on the intersection ratio threshold and the face coordinate information of each detection frame to obtain a target detection frame, including:

carrying out cross comparison calculation on the detection frame with the highest confidence score and other detection frames to obtain cross comparison values corresponding to the plurality of detection frames;

dividing the detection frames based on the cross-comparison values corresponding to the detection frames to obtain the detection frame with the highest confidence score, a plurality of adjacent detection frames with the cross-comparison values larger than the cross-comparison threshold and a plurality of residual detection frames with the cross-comparison values smaller than or equal to the cross-comparison threshold;

removing independent detection frames in the plurality of adjacent detection frames based on the detection frame with the highest confidence score and the face coordinate information of the plurality of adjacent detection frames to obtain a plurality of adjacent detection frames after removal;

and carrying out non-maximum suppression processing on the detection frame with the highest confidence score and the plurality of adjacent detection frames based on the weight of each detection frame to obtain the target detection frame.

Optionally, dividing the plurality of detection frames based on the cross-over ratio values corresponding to the plurality of detection frames to obtain a detection frame with the highest confidence score, a plurality of adjacent detection frames with cross-over ratio values larger than the cross-over ratio threshold, and a plurality of remaining detection frames with cross-over ratio values smaller than or equal to the cross-over ratio threshold, includes:

taking the detection frames with the intersection ratio value larger than the intersection ratio threshold value except the detection frame with the highest confidence score as the adjacent detection frames, and taking the rest detection frames as the rest detection frames;

and if the number of the adjacent detection boxes is 1, sending an execution command.

If the number of the adjacent detection frames is larger than 1 and the intersection ratio value of each adjacent detection frame is 1, deleting the adjacent detection frames until the number is 1;

when the number of the adjacent detection frames is 1 and the number of the remaining detection frames is 0, copying the adjacent detection frames as the remaining detection frames.

Optionally, the removing, based on the face coordinate information of the detection frame with the highest confidence score and the plurality of adjacent detection frames, an independent detection frame of the plurality of adjacent detection frames to obtain a plurality of removed adjacent detection frames includes:

determining the difference degree between the face coordinate information of the detection frame with the highest confidence score and the face coordinate information of the adjacent detection frames;

and eliminating the adjacent detection frames with the difference degrees larger than the difference degree threshold value to obtain the remaining adjacent detection frames.

Optionally, the method further includes:

determining the contact ratio between the face coordinate information of the detection frame with the highest confidence score and the face coordinate information of the adjacent detection frames;

and performing weight reduction processing on the weight of the adjacent detection frame with the contact ratio larger than the contact ratio threshold value to obtain a new weight of the detection frame.

Optionally, performing non-maximum suppression processing on the detection frame with the highest confidence score and the plurality of adjacent detection frames based on the weight of each detection frame to obtain the target detection frame, including:

carrying out weight distribution on each adjacent detection frame based on the intersection ratio value and the uncertainty of each adjacent detection frame and the variance between each adjacent detection frame to obtain the weight of each adjacent detection frame;

and performing non-maximum suppression processing based on the weight of each adjacent detection frame to obtain the target detection frame.

The present application further provides a detection frame processing apparatus for target detection, including:

the video stream detection module is used for inputting an original human face video stream into a target detection model by taking a frame as a unit to obtain a plurality of initial detection frames and a confidence score corresponding to each initial detection frame;

the detection frame filtering module is used for taking the initial detection frame with the confidence coefficient score larger than a confidence coefficient threshold value as a detection frame;

the intersection ratio calculation module is used for determining an intersection ratio threshold value based on the number of the detection frames;

and the non-maximum value suppression processing module is used for performing non-maximum value suppression processing on the detection frames based on the intersection ratio threshold and the face coordinate information of each detection frame to obtain a target detection frame.

The present application further provides a server, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the detection frame processing method as described above when executing the computer program.

The present application also provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps of the detection frame processing method as described above.

The application provides a detection frame processing method for target detection, which comprises the following steps: inputting an original face video stream into a target detection model by taking a frame as a unit to obtain a plurality of initial detection frames and a confidence score corresponding to each initial detection frame; taking the initial detection frame with the confidence score larger than a confidence threshold value as a detection frame; determining a cross-over ratio threshold based on a number of the plurality of detection boxes; and carrying out non-maximum suppression processing on the plurality of detection frames based on the intersection ratio threshold and the face coordinate information of each detection frame to obtain a target detection frame.

The intersection ratio threshold value is dynamically adjusted through the number of the initial detection frames, and then non-maximum suppression processing is carried out based on the dynamically adjusted intersection ratio threshold value and the face coordinate information, so that redundant detection frames are eliminated, detection deviation caused by the same detection frame when the image with the same scaling scale is repeatedly detected for multiple times is avoided, and accuracy of target detection is improved.

The present application further provides a detection frame processing apparatus for target detection, a server, and a computer-readable storage medium, which have the above beneficial effects, and are not described herein again.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only the embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a detection frame processing method for target detection according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of a detection frame processing apparatus for target detection according to an embodiment of the present disclosure.

Detailed Description

The core of the application is to provide a detection frame processing method, a detection frame processing device, a server and a computer readable storage medium for target detection, so as to improve the accuracy of a detection frame in a target detection process.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Therefore, the application provides a detection frame processing method for target detection, which dynamically adjusts the intersection ratio threshold value through the number of the initial detection frames, and then performs non-maximum suppression processing based on the dynamically adjusted intersection ratio threshold value and the face coordinate information, so as to eliminate the repeated redundant detection frames, avoid the detection deviation caused by the same detection frame when the image with the same zoom scale is repeatedly detected for multiple times, and improve the accuracy of target detection.

The following describes a detection frame processing method for object detection according to an embodiment.

Referring to fig. 1, fig. 1 is a flowchart of a detection frame processing method for target detection according to an embodiment of the present disclosure.

In this embodiment, the method may include:

s101, inputting an original face video stream into a target detection model by taking a frame as a unit to obtain a plurality of initial detection frames and a confidence score corresponding to each initial detection frame;

the method comprises the steps of inputting an original human face video stream into a target detection model by taking a frame as a unit to obtain a plurality of initial detection frames and a confidence score corresponding to each initial detection frame. That is, the original face video stream is intercepted according to frames and sequentially input into a target detection model, and RPNs (regional pro-social networks, Network generation by region selection) are used to generate anchors of different scales and confidence scores corresponding to the anchors. If the contents of a plurality of video frames are not changed basically in a period of time, the method is approximately equivalent to that the model input is a plurality of detection images with the same contents and the same scaling scale.

S102, taking the initial detection frame with the confidence coefficient score larger than the confidence coefficient threshold value as a detection frame;

this step is intended to take the initial detection box with the confidence score larger than the confidence threshold as the detection box on the basis of S101. That is, the detection frames are screened, and the detection frames which cannot meet the requirements are removed. Further, in this step, the anchors are sorted from high to low according to the confidence score, and the anchors with the confidence scores higher than the threshold are screened out and further merged, that is, the subsequent operation steps are performed.

S103, determining a threshold value of the intersection ratio based on the number of the detection frames;

on the basis of S102, this step is intended to determine the intersection ratio threshold value based on the number of the plurality of detection boxes. That is, the intersection ratio threshold, i.e., the nms threshold, in the present embodiment is not fixed, and may be dynamically adjusted based on the number of detection frames, so as to improve the detection accuracy.

Further, the step may include:

step 1, judging whether the number of a plurality of detection frames is greater than a number threshold value;

step 2, if yes, setting the first threshold as a cross-over ratio threshold;

step 3, if not, setting the second threshold value as an intersection ratio threshold value; wherein the first threshold is greater than the second threshold.

It can be seen that this step mainly illustrates how the intersection ratio threshold is determined based on the number of detection boxes. The intersection-ratio threshold (nms threshold) of dynamic adjustment is realized, the nms threshold of the existing nms algorithm usually adopts a fixed constant, and if the threshold is too low, when detection objects in an image are dense, missed detection is easy to occur. If the threshold value is too high, redundant detection frames are easily caused when detection objects are sparse, so that different nms threshold values can be dynamically selected according to the number of the detection frames when the threshold value is set, and the detection precision is improved.

And S104, performing non-maximum suppression processing on the plurality of detection frames based on the intersection ratio threshold and the face coordinate information of each detection frame to obtain the target detection frame.

On the basis of S103, this step is intended to perform non-maximum suppression processing on the plurality of detection frames based on the intersection ratio threshold and the face coordinate information of each detection frame, resulting in a target detection frame.

And judging whether the relation between the adjacent frame and the frame with the highest confidence score is a redundant item or an independent detection frame based on whether the face coordinate information corresponding to each detection frame is relatively independent. If the frame with the maximum confidence score is compared with the adjacent frames, and if the facial coordinates of the two frames are relatively independent, the subsequent frame merging process can be removed from the adjacent frames even if the iou of the two detected frames is higher. And if the coincidence degree of the face coordinate information corresponding to the two adjacent frames is high, reducing the weight of the adjacent frames, and simultaneously sending the adjacent frames into a subsequent flow to participate in the detection frame for combination. That is, independent or redundant detection frames in the detection frames can be effectively removed based on the face coordinate information, and the processing precision of the detection frames is improved.

The face coordinate information may be coordinate point information of facial features, and may be more detailed coordinate point information.

Further, the step may include:

step 1, performing cross-comparison calculation on a detection frame with the highest confidence score and other detection frames to obtain cross-comparison values corresponding to a plurality of detection frames;

step 2, dividing the plurality of detection frames based on the intersection ratio values corresponding to the plurality of detection frames to obtain the detection frame with the highest confidence score, a plurality of adjacent detection frames with intersection ratio values larger than an intersection ratio threshold value and a plurality of residual detection frames with intersection ratio values smaller than or equal to the intersection ratio threshold value;

step 3, removing independent detection frames in the plurality of adjacent detection frames based on the face coordinate information of the detection frame with the highest confidence score and the plurality of adjacent detection frames to obtain a plurality of adjacent detection frames after removal;

and 4, performing non-maximum value suppression processing on the detection frame with the highest confidence score and the plurality of adjacent detection frames based on the weight of each detection frame to obtain a target detection frame.

It can be seen that the present alternative is primarily illustrative of how maxima suppression processing can be performed. The method mainly comprises the steps of judging whether two adjacent detection frames correspond to the same detection object or not according to the fact that face coordinate information serves as auxiliary information, and judging whether the adjacent frames surround the same target detection object or not according to the fact that the face coordinate information is relatively independent.

Further, step 2 in the last alternative may include:

step 201, using the detection frames with the intersection ratio value larger than the intersection ratio threshold value except the detection frame with the highest confidence score as adjacent detection frames, and using the remaining detection frames as remaining detection frames;

in step 202, if the number of adjacent detection frames is 1, an execution command is sent.

Step 203, if the number of the adjacent detection frames is greater than 1 and the intersection ratio value of each adjacent detection frame is 1, deleting the adjacent detection frames until the number is 1;

in step 204, when the number of adjacent detection frames is 1 and the number of remaining detection frames is 0, the adjacent detection frames are copied as the remaining detection frames.

It can be seen that the present alternative scheme mainly illustrates how the detection blocks are divided. According to the alternative scheme, the detection frames can be divided into the detection frame with the highest confidence score, a plurality of adjacent detection frames with the cross-over ratio value larger than the cross-over ratio threshold value and a plurality of residual detection frames with the cross-over ratio value smaller than or equal to the cross-over ratio threshold value. And, the number of adjacent detection frames and remaining detection frames is kept at least 1.

Further, step 3 in the last alternative may include:

step 301, determining the difference between the face coordinate information of the detection frame with the highest confidence score and the face coordinate information of a plurality of adjacent detection frames;

and step 302, eliminating the adjacent detection frames with the difference degrees larger than the difference degree threshold value to obtain the remaining adjacent detection frames.

Therefore, the alternative scheme mainly explains how to eliminate the detection box based on the face coordinate information. The processing of mutually independent detection frames is avoided, the impurities for processing the detection frames are reduced, and the accuracy of target detection is improved.

Further, the method can also comprise the following steps:

step 303, determining the coincidence degree between the face coordinate information of the detection frame with the highest confidence score and the face coordinate information of a plurality of adjacent detection frames;

and step 304, performing weight reduction processing on the weight of the adjacent detection frame with the contact ratio larger than the contact ratio threshold value to obtain a new weight of the detection frame.

On the basis of the above alternative, the present alternative mainly explains that the weight of the detection blocks which are very overlapped is reduced.

Further, step 4 in the last alternative may include:

step 401, performing weight distribution on each adjacent detection frame based on the cross-comparison value and the uncertainty of each adjacent detection frame and the variance between each adjacent detection frame to obtain the weight of each adjacent detection frame;

and 402, performing non-maximum suppression processing based on the weight of each adjacent detection frame to obtain a target detection frame.

In summary, in the embodiment, the intersection ratio threshold is dynamically adjusted by the number of the initial detection frames, and then the non-maximum suppression processing is performed based on the dynamically adjusted intersection ratio threshold and the face coordinate information, so as to eliminate the redundant detection frames, avoid the detection deviation caused by the same detection frame when the image with the same zoom scale is repeatedly detected for multiple times, and improve the accuracy of target detection.

The following further describes a method for processing a detection frame for target detection according to another specific embodiment.

In the embodiment, the dynamically adjusted nms threshold is firstly used for presetting the threshold of the number of anchors, when the number of the anchors detected by the image exceeds the threshold, the image targets are dense, the high nms threshold is selected, the number of the adjacent anchors needing to be filtered is reduced, and the missing detection is reduced. If the number of anchors is detected to be lower than the threshold value, selecting a low nms threshold value, filtering out repeated detection frames as much as possible, and keeping the detection frame with the highest confidence coefficient at the same position.

Then, whether the adjacent frames and the frame with the highest confidence score are redundant items or independent detection frames is judged according to whether the face five-sense organ coordinate point landmark (coordinate information) corresponding to each target detection frame is relatively independent. If the box with the maximum confidence score is compared with the adjacent boxes, if the five sense organs and the outline drawn by the two landmark boxes are relatively independent, the adjacent boxes are regarded as independent boxes, and the detection object with the box with the highest confidence score is not the same. And if the overlap ratio of the landmark coordinates corresponding to the two adjacent frames is high, the adjacent frames are considered as redundant frames, the weight of the adjacent frames is reduced, and the adjacent frames are sent to the subsequent process to participate in the detection frame combination.

Therefore, the problem that the detection results obtained by multitime detection and single detection of the same picture through the vote-nms filtering are inconsistent is solved. When the detection frames are combined, firstly, judgment is carried out, if a plurality of adjacent anchors exist, and the iou of the adjacent anchors and the selected anchor with the highest confidence score is 1, namely the coordinates of the adjacent anchors and the selected anchor are completely overlapped, redundant adjacent anchors which are completely overlapped are deleted, and the number of the adjacent anchors is reset to one, so that the subsequent detection frames are not combined.

According to the above claim, it essentially comprises the following steps:

step 1, intercepting original face video streams according to frames and sequentially inputting the original face video streams into a target detection model, and generating anchors of different scales and confidence scores corresponding to the anchors by using an RPN (resilient packet network). If the contents of a plurality of video frames are not changed basically in a period of time, the method is approximately equivalent to that the model input is a plurality of detection images with the same contents and the same scaling scale.

And 2, sorting the anchors according to the confidence score from high to low, screening the anchors with the confidence scores higher than a threshold value, and further combining the screened anchors.

And 3, acquiring the number of all anchors and presetting anchor threshold values. And if the total number of anchors exceeds the anchor threshold value, the distribution of the targets is dense, and the nms threshold value is increased to be the high threshold value A. And conversely, when the target distribution is sparse, the nms threshold is reduced to be the low threshold B.

And step 4, finding the anchorages with the highest confidence scores, respectively calculating the iou (Intersection-over-Union ratio) of the anchorages and other anchorages, screening out adjacent anchorages with the iou exceeding a preset nms threshold value, and recording idx (index) of the anchorages.

And 5, respectively screening out adjacent anchorages with higher iou intersection and higher degree in the step 4 and the residual anchorages, and dividing into 2 groups.

And 6, if the number of the adjacent anchors is only one, directly executing the subsequent voting weight distribution step by grouping the adjacent anchors and the rest anchors.

And 7, if a plurality of adjacent anchors exist, and the iou of the adjacent anchors and the selected anchor with the highest confidence score is 1, namely the coordinates of the adjacent anchors and the selected anchor are completely coincident, deleting redundant completely coincident adjacent anchors, and resetting the number of the other adjacent anchors to one.

And 8, under the condition that the steps 6 or 7 are met, if the number of anchors with the iou lower than the threshold value is zero, copying the adjacent anchors as the residual anchors, and participating in the subsequent weight distribution step together.

And 9, removing independent frames according to the human face landmark information, and sequentially comparing whether the anchors with the highest confidence score and the landmarks corresponding to all the adjacent anchors are relatively independent. If landmark of the highest confidence score box and neighboring anchors includes, but is not limited to: the five sense organs of the two faces are far away from each other in distribution distance, the outline size difference of the two faces is large (one face is a foreground, the other face is a distant view), the angle difference of the two faces is large (one face is a front face and the other face is a side face), the frame with the highest confidence score and the adjacent anchors are stripped, the adjacent anchors are extracted to be used as the detection frame of the next round of integral iteration, and the adjacent anchors do not participate in the merge of the vote-nms detection frames.

And step 10, if the overlap ratio of the landmark corresponding to the two detection frames is very high, and the overlap ratio is the same as the condition of the previous step, the probability of the adjacent anchor is a redundant frame, a lower weight is distributed to the anchor, and the anchor is sent to the next step to participate in the combination calculation of the detection frames.

And step 11, jointly participating the box with the highest confidence score and all the reserved adjacent anchors in the previous step in the vote-nms. The anchor that is closer (higher IOU) and less uncertain is assigned a higher weight. And calculating the variance of each adjacent anchor, and allocating lower weights to the adjacent anchors with high variance and the anchors with smaller IOU (input output) of the selected anchors to be used as a final detection set together.

And step 12, performing weighted average on all coordinates according to the detection collection in the previous step, calculating new coordinates to serve as a first final target frame, and clearing anchors and confidence scores of the detection collections.

And step 13, after one round of circulation, finding the anchor with the highest confidence score in the rest detection frames, repeating the steps 3-12 to form a target frame of the next human face until almost no overlapped frame exists at last, and finishing the process of combining all anchors.

Therefore, in the embodiment, the intersection ratio threshold is dynamically adjusted according to the number of the initial detection frames, and then the non-maximum suppression processing is performed based on the dynamically adjusted intersection ratio threshold and the face coordinate information, so that redundant detection frames are eliminated, detection deviation caused by the same detection frame when an image with the same zoom scale is repeatedly detected for multiple times is avoided, and the accuracy of target detection is improved.

In the following, a detection frame processing apparatus for object detection provided in the embodiments of the present application is introduced, and a detection frame processing apparatus for object detection described below and a detection frame processing method for object detection described above may be referred to correspondingly.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a detection frame processing apparatus for target detection according to an embodiment of the present disclosure.

In this embodiment, the apparatus may include:

the video stream detection module 100 is configured to input an original face video stream into a target detection model by taking a frame as a unit, so as to obtain a plurality of initial detection frames and a confidence score corresponding to each initial detection frame;

a detection frame filtering module 200, configured to use an initial detection frame with a confidence score greater than a confidence threshold as a detection frame;

a cross-over ratio calculation module 300 for determining a cross-over ratio threshold based on the number of the plurality of detection boxes;

and a non-maximum suppression processing module 400, configured to perform non-maximum suppression processing on the multiple detection frames based on the intersection ratio threshold and the face coordinate information of each detection frame, so as to obtain a target detection frame.

An embodiment of the present application further provides a server, including:

a memory for storing a computer program;

a processor for implementing the steps of the detection frame processing method according to the above embodiment when executing the computer program.

The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the detection frame processing method according to the above embodiment are implemented.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The detailed description has been given above on a detection frame processing method, a detection frame processing apparatus, a server, and a computer-readable storage medium for object detection provided by the present application. The principles and embodiments of the present application are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

Claims

1. A detection frame processing method for target detection is characterized by comprising the following steps:

2. The detection box processing method according to claim 1, wherein determining the intersection ratio threshold based on the number of the plurality of detection boxes comprises:

3. The detection frame processing method according to claim 1, wherein performing non-maximum suppression processing on a plurality of detection frames based on the cross-over ratio threshold and face coordinate information of each detection frame to obtain a target detection frame includes:

and carrying out non-maximum suppression processing on the detection frame with the highest confidence coefficient score and the plurality of adjacent detection frames based on the weight of each detection frame to obtain the target detection frame.

4. The detection frame processing method according to claim 3, wherein dividing the plurality of detection frames based on the cross-over ratio values corresponding to the plurality of detection frames to obtain a detection frame with a highest confidence score, a plurality of adjacent detection frames with a cross-over ratio value larger than the cross-over ratio threshold, and a plurality of remaining detection frames with a cross-over ratio value smaller than or equal to the cross-over ratio threshold includes:

5. The detection frame processing method according to claim 3, wherein the removing, based on the face coordinate information of the detection frame with the highest confidence score and the plurality of adjacent detection frames, an independent detection frame of the plurality of adjacent detection frames to obtain a plurality of adjacent detection frames after removal, comprises:

6. The detection frame processing method according to claim 5, further comprising:

and carrying out weight reduction processing on the weights of the adjacent detection frames with the contact ratio larger than the contact ratio threshold value to obtain new weights of the detection frames.

7. The detection frame processing method according to claim 3, wherein performing non-maximum suppression processing on the detection frame with the highest confidence score and the plurality of adjacent detection frames based on the weight of each detection frame to obtain the target detection frame comprises:

8. A detection frame processing apparatus for object detection, comprising:

9. A server, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the detection frame processing method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which, when being executed by a processor, implements the steps of the detection box processing method according to any one of claims 1 to 7.