CN110532985B

CN110532985B - Target detection method, device and system

Info

Publication number: CN110532985B
Application number: CN201910830745.3A
Authority: CN
Inventors: 刘竞爽; 王建雄; 鲍一平; 俞刚
Original assignee: Beijing Megvii Technology Co Ltd
Current assignee: Beijing Megvii Technology Co Ltd
Priority date: 2019-09-02
Filing date: 2019-09-02
Publication date: 2022-07-22
Anticipated expiration: 2039-09-02
Also published as: CN110532985A

Abstract

The invention provides a target detection method, a device and a system, which relate to the technical field of artificial intelligence and comprise the steps of obtaining an image to be detected containing a target object; performing target detection on an image to be detected through at least two preset anchor point frames to generate at least two initial detection results of a target object; each anchor point frame correspondingly generates an initial detection result, and at least two anchor point frames comprise a first anchor point frame set based on the local characteristics of the category to which the target object belongs and a second anchor point frame set based on the whole body characteristics of the category to which the target object belongs; and fusing at least two initial detection results to obtain a final detection result of the target object. The method can effectively improve the accuracy of the target detection result.

Description

Target detection method, device and system

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a target detection method, a target detection device and a target detection system.

Background

The object detection technology is an important branch in the field of computer vision, and the main purpose is to detect objects of a preset category (such as pedestrians, vehicles, cats and the like) from an image, such as detecting the position of a pedestrian contained in the image based on the whole-body features of the pedestrian.

However, the inventors have found through research that in the existing target detection technology, it may be difficult to detect the target due to the shielding of the target to be detected, resulting in a low accuracy of the target detection result. For the convenience of understanding, taking pedestrian detection as an example, a situation that part of the body of a pedestrian is blocked by other objects (such as a table and a chair, a shopping cart, a table, and the like) may occur, and due to the lack of the characteristics of part of the body, it is difficult for the pedestrian detection technology to detect the pedestrian, so that the accuracy of the pedestrian detection result is affected.

Disclosure of Invention

In view of the above, the present invention provides a method, an apparatus and a system for detecting a target, which can effectively improve the accuracy of a target detection result.

In order to achieve the above purpose, the embodiment of the present invention adopts the following technical solutions:

in a first aspect, an embodiment of the present invention provides a target detection method, including: acquiring an image to be detected containing a target object; performing target detection on the image to be detected through at least two preset anchor point frames to generate at least two initial detection results of the target object; each anchor point frame correspondingly generates an initial detection result, and the at least two anchor point frames comprise a first anchor point frame set based on the local characteristics of the category to which the target object belongs and a second anchor point frame set based on the whole-body characteristics of the category to which the target object belongs; and fusing the at least two initial detection results to obtain a final detection result of the target object.

Further, the step of performing target detection on the image to be detected through at least two preset anchor boxes to generate at least two initial detection results of the target object includes: performing target detection on the image to be detected based on the first anchor point frame to generate a first initial detection result of the target object; the first initial detection result comprises a local detection frame of the target object and a first whole-body detection frame of the target object; the local detection frame is positioned in the first whole-body detection frame; performing target detection on the image to be detected based on the second anchor point frame to generate a second initial detection result of the target object; the second initial detection result comprises a second whole-body detection frame of the target object.

Further, the step of fusing the at least two initial detection results to obtain a final detection result of the target object includes: and fusing the first whole body detection frame and the second whole body detection frame of the target object to obtain a final whole body detection frame of the target object.

Further, the step of fusing the first whole-body detection frame and the second whole-body detection frame of the target object to obtain a final whole-body detection frame of the target object includes: and combining the first whole body detection frame and the second whole body detection frame of the target object based on an intersection-comparison algorithm to obtain a final whole body detection frame of the target object.

Further, the step of combining the first whole-body detection frame and the second whole-body detection frame of the target object based on the cross-over comparison algorithm to obtain a final whole-body detection frame of the target object includes: setting a cutting line based on the position of a first whole body detection frame of the target object; wherein the cutting line cuts the first whole-body detection frame of the target object into a first upper region and a first lower region; the area of the first upper region is not greater than the first lower region, and the first upper region includes the local detection box; the cutting line cuts a second whole-body detection frame of the target object into a second upper region and a second lower region; calculating an intersection ratio of the first upper region and the second upper region; and merging the first whole-body detection frame and the second whole-body detection frame of the target object based on the intersection ratio to obtain a final whole-body detection frame of the target object.

Further, the number of the target objects is one; the step of combining the first whole-body detection frame and the second whole-body detection frame of the target object based on the intersection ratio to obtain a final whole-body detection frame of the target object includes: and if the intersection ratio is larger than a preset intersection ratio threshold, deleting the first whole body detection frame, and taking the second whole body detection frame as a final whole body detection frame of the target object.

Further, the number of the target objects is multiple; the step of calculating the intersection ratio of the first upper region and the second upper region includes: calculating intersection ratios between every two first upper regions of the target objects and second upper regions of the target objects; the step of combining the first whole-body detection frame and the second whole-body detection frame of the target object based on the intersection ratio to obtain a final whole-body detection frame of the target object includes: determining a first upper area and a second upper area belonging to the same target object based on the calculated intersection ratio and a preset intersection ratio threshold; the intersection ratio of the first upper area and the second upper area belonging to the same target object is greater than the intersection ratio of the first upper area and the second upper area belonging to different target objects, and the intersection ratio of the first upper area and the second upper area belonging to the same target object is greater than the intersection ratio threshold; for the same target object, a first whole body detection frame to which a first upper region of the target object belongs is deleted, and a second whole body detection frame to which a second upper region of the target object belongs is taken as a final whole body detection frame of the target object.

Further, the number of the target objects is at least one; the number of the first whole body detection frames and the number of the second whole body detection frames are both multiple; before the step of merging the first whole-body detection frame and the second whole-body detection frame of the target object based on the cross-comparison algorithm, the method further includes: performing confidence threshold filtering on the plurality of first whole-body detection frames and the plurality of second whole-body detection frames to obtain filtered first whole-body detection frames and filtered second whole-body detection frames; and processing the filtered first whole-body detection frame and the filtered second whole-body detection frame by adopting a non-maximum suppression algorithm to obtain a first whole-body detection frame and a second whole-body detection frame corresponding to each target object.

Further, the step of performing confidence threshold filtering on the plurality of first whole-body detection frames and the plurality of second whole-body detection frames to obtain filtered first whole-body detection frames and filtered second whole-body detection frames includes: respectively judging whether the confidence coefficient of each first whole body detection frame is lower than a preset first confidence coefficient threshold value; wherein the first confidence threshold is a confidence threshold set based on the local features of the category to which the target object belongs; respectively judging whether the confidence coefficient of each second whole body detection frame is lower than a preset second confidence coefficient threshold value; wherein the second confidence threshold is a confidence threshold set based on the whole-body feature of the class to which the target object belongs; and filtering out the first whole body detection frame lower than the first confidence coefficient threshold value and the second whole body detection frame lower than the second confidence coefficient threshold value to obtain the filtered first whole body detection frame and the filtered second whole body detection frame.

In a second aspect, an embodiment of the present invention further provides an object detection apparatus, including: the image acquisition module is used for acquiring an image to be detected containing a target object; the target detection module is used for carrying out target detection on the image to be detected through at least two preset anchor point frames to generate at least two initial detection results of the target object; each anchor point frame correspondingly generates an initial detection result, and the at least two anchor point frames comprise a first anchor point frame set based on the local characteristics of the category to which the target object belongs and a second anchor point frame set based on the whole body characteristics of the category to which the target object belongs; and the fusion module is used for fusing the at least two initial detection results to obtain a final detection result of the target object.

In a third aspect, an embodiment of the present invention provides a target detection system, where the system includes: the device comprises an image acquisition device, a processor and a storage device; the image acquisition device is used for acquiring an image to be detected; the storage means has stored thereon a computer program which, when executed by the processor, performs the method of any of the first aspects.

In a fourth aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, performs the steps of the method according to any one of the above first aspects.

The embodiment of the invention provides a target detection method, a device and a system, which can carry out target detection on an acquired image to be detected through at least two preset anchor point frames to generate at least two initial detection results of a target object, and then fuse the at least two initial detection results to obtain a final detection result of the target object; and each anchor point frame correspondingly generates an initial detection result, and the at least two anchor point frames comprise a first anchor point frame set based on the local characteristics of the class to which the target object belongs and a second anchor point frame set based on the whole body characteristics of the class to which the target object belongs. According to the embodiment, the target detection can be performed on the image to be detected based on at least two anchor points arranged on different characteristics (local characteristics and whole-body characteristics) of the category to which the target object belongs to obtain different initial detection results, the different initial detection results are fused to obtain a final detection result, the influence of the target detection result caused by the characteristic loss due to the fact that the target object is partially shielded is effectively relieved, and the accuracy of the target detection result is improved.

Additional features and advantages of embodiments of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of embodiments of the invention as set forth above.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating a method for detecting a target according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a detection network based on a dual anchor block according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a pedestrian detection block provided by an embodiment of the present invention;

fig. 5 shows a block diagram of a target detection apparatus according to an embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, not all, embodiments of the present invention. In view of the problem of low accuracy of a target detection result caused by the fact that a target to be detected is blocked in the existing target detection technology, embodiments of the present invention provide a target detection method, apparatus, and system to improve the problem. The following describes embodiments of the present invention in detail.

The first embodiment is as follows:

first, an exemplary electronic device 100 for implementing an object detection method, apparatus, and system according to an embodiment of the present invention is described with reference to fig. 1.

As shown in fig. 1, an electronic device 100 includes one or more processors 102, one or more memory devices 104, an input device 106, an output device 108, and an image capture device 110, which are interconnected via a bus system 112 and/or other type of connection mechanism (not shown). It should be noted that the components and configuration of the electronic device 100 shown in FIG. 1 are exemplary only, and not limiting, and that the electronic device may have other components and configurations as desired.

The processor 102 may be implemented in at least one hardware form of a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), the processor 102 may be one or a combination of Central Processing Units (CPU), a Graphics Processing Unit (GPU), or other forms of processing units with data processing capability and/or instruction execution capability, and may control other components in the electronic device 100 to perform desired functions.

The storage 104 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. On which one or more computer program instructions may be stored that may be executed by processor 102 to implement client functionality (implemented by the processor) and/or other desired functionality in embodiments of the invention described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.

The input device 106 may be a device used by a user to input instructions and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like.

The output device 108 may output various information (e.g., images or sounds) to the outside (e.g., a user), and may include one or more of a display, a speaker, and the like.

The image capture device 110 may take images (e.g., photographs, videos, etc.) desired by the user and store the taken images in the storage device 104 for use by other components.

Exemplary electronic devices for implementing the object detection method, apparatus and system according to embodiments of the present invention may be implemented as smart terminals such as smart phones, tablet computers, smart cameras, and the like.

Example two:

referring to a flowchart of an object detection method shown in fig. 2, the method may be executed by the electronic device provided in the foregoing embodiment, and the method mainly includes the following steps S202 to S206:

step S202, an image to be detected including the target object is acquired. In practical application, an original image containing a target object and shot by an image acquisition device such as a camera can be used as an image to be detected, and an image containing the target object and downloaded by a network, locally stored or manually uploaded can be used as an image to be detected. The image to be detected may include at least one target object, which may be a person or a vehicle, etc.

Step S204, performing target detection on the image to be detected through at least two preset anchor point frames to generate at least two initial detection results of the target object. The anchor box may be specifically understood as a bounding box represented by two parameters, namely, area (scale) and aspect ratio (aspect), and the parameters of the anchor box may be set according to actual detection requirements, and anchor boxes with different parameters may be regarded as different types of anchor boxes. The target detection can be carried out on the image to be detected based on at least two anchor boxes through a detection network, and each anchor box correspondingly generates an initial detection result; the initial detection result may include a position of a target object included in the image to be detected. The at least two anchor frames include a first anchor frame set based on the local feature of the category to which the target object belongs and a second anchor frame set based on the global feature of the category to which the target object belongs.

For ease of understanding, the descriptions will be given by taking the classes of the target object as a pedestrian and a vehicle, respectively. If the category of the target object is a pedestrian, the local feature of the pedestrian can be a face feature, and the first anchor point frame set based on the face feature can be an anchor point frame set by adopting certain preset face key points as datum points; the whole-body feature of the pedestrian comprises the whole feature from head to foot of the pedestrian, and the second anchor point frame set based on the whole-body feature of the pedestrian can be an anchor point frame set by taking some preset whole-body key points as reference points. If the category to which the target object belongs is a vehicle, the local feature of the vehicle can be a license plate feature, and the first anchor point frame set based on the license plate feature can be an anchor point frame set by taking a license plate center point and a license plate corner point as reference points; the whole-body feature of the vehicle includes a feature of the whole vehicle, and the anchor block of the second kind set based on the whole-body feature of the vehicle may be an anchor block set using a key point such as a preset main portion of the vehicle as a reference point.

And step S206, fusing at least two initial detection results to obtain a final detection result of the target object.

The initial detection results may include a first initial detection result generated by the first type of anchor frame and a second initial detection result generated by the second type of anchor frame. In a real scene, a target object often has a situation that a part of a body is shielded, such as a rear-row human body in a group photo only exposes the face and the body is shielded, and a part of a vehicle body is shielded by a tree, a billboard and the like on a road. In addition, since two kinds of anchor point frames are set based on different features, the accuracy of different initial detection results is different, such as referring to the above two kinds of anchor point frames taking a pedestrian as an example, the detection accuracy of the corresponding first kind of initial detection result on the face may be higher, and the detection accuracy of the corresponding second kind of initial detection result on the whole body of the pedestrian may be higher, but once a part of the body of the pedestrian is blocked, the pedestrian may not be detected by the second kind of initial detection result. Furthermore, in the embodiment, the first initial detection result and the second initial detection result are fused, so that a more accurate final detection result of the target object can be obtained.

The target detection method provided by the embodiment of the invention can perform target detection on the image to be detected based on at least two anchor point frames arranged on different characteristics (local characteristics and whole-body characteristics) of the category to which the target object belongs to obtain different initial detection results, and fuse the different initial detection results to obtain a final detection result, thereby effectively relieving the influence on the target detection result caused by the characteristic loss due to partial shielding of the target object, and improving the accuracy of the target detection result.

In practical applications, this embodiment provides a detection network (also referred to as a detection model) based on a dual anchor frame as shown in fig. 3, which implements the above steps of performing target detection on an image to be detected through at least two preset anchor frames, and generating at least two initial detection results of a target object. The detection network based on the double anchor point frame is a network which takes a pedestrian as a target object for detection. As shown in fig. 3, the dual anchor block-based detection network includes a backbone network and a plurality of network branches connected to the backbone network. The backbone Network can be implemented by a Feature Pyramid Network (FPN), the number of Network branches is multiple, and the number of Network branches corresponds to the scale type of the Feature graph output by the Feature Pyramid. The input of the backbone network is an image to be detected, and the output of the backbone network is a characteristic diagram of the image to be detected. Each network branch comprises a face regression main branch, a pedestrian regression auxiliary branch and a pedestrian regression main branch, the face regression main branch is input into a feature map of an image to be detected and a first anchor point frame, and the face regression main branch is output into a face detection frame. The input of the secondary pedestrian regression branch is a feature map of the image to be detected and a first anchor point frame, and the output of the secondary pedestrian regression branch is a first pedestrian detection frame. It can be understood that the face detection frame and the first pedestrian detection frame are both obtained based on the regression of the first anchor point frame, so that the first pedestrian detection frame and the face detection frame have a corresponding relationship (binding relationship), and actually, the first pedestrian detection frame includes the face detection frame. The input of the pedestrian regression main branch is a second anchor point frame and a characteristic diagram of an image to be detected, and the output of the pedestrian regression main branch is a second pedestrian detection frame. In practical applications, even if the number of the target objects in the image to be detected is one, the number of the first pedestrian detection frame and the number of the second pedestrian detection frame may be multiple, such as multiple pedestrian detection frames with different confidence degrees are generated for the same target object.

In combination with the above detection network based on the dual anchor frame, the manner of performing target detection on an image to be detected to generate an initial detection result of a target object may refer to the following steps (1) and (2):

(1) performing target detection on an image to be detected based on the first anchor point frame to generate a first initial detection result of a target object; the first initial detection result comprises a local detection frame of the target object and a first whole-body detection frame of the target object; the local detection frame is positioned in the first whole body detection frame.

And performing target detection on the feature map of the image to be detected based on the first anchor point frame by the face regression main branch to obtain a local detection frame of the target object, wherein the local detection frame corresponds to the face detection frame in the image 3. And performing target detection on the feature map of the image to be detected based on the first type of anchor point frame through pedestrian regression to obtain a first whole body detection frame of the target object, wherein the first whole body detection frame corresponds to the first pedestrian detection frame in the image 3. In practical application, the pedestrian label of the first pedestrian detection frame can be set according to the face detection frame. It is understood that if a face is detected to exist, a pedestrian corresponding to the face necessarily exists, but if a pedestrian exists, the face of the pedestrian does not necessarily exist (for example, only a shadow of the pedestrian is detected), and of course, a situation of false detection of the pedestrian may also occur, such as identifying a dummy model as a pedestrian, so that a pedestrian label is further set for the first pedestrian detection frame according to the following three situations:

in case one, the first pedestrian detection frame exists and corresponds to the face detection frame, and the pedestrian label of the first pedestrian detection frame can be set to 1, which indicates that there is a pedestrian; in case two, if there is a first pedestrian detection frame without a corresponding face detection frame, the pedestrian label of the first pedestrian detection frame may be set to-1, indicating that the pedestrian is ignored; in case three, if there is no pedestrian included in the first pedestrian detection frame, the pedestrian flag of the first pedestrian detection frame may be set to 0, indicating that there is no pedestrian. In practical applications, the processing may be performed only for the first pedestrian detection frame with the pedestrian label of 1.

(2) Performing target detection on the image to be detected based on the second anchor point frame to generate a second initial detection result of the target object; the second initial detection result includes a second whole-body detection frame of the target object.

And performing target detection on the feature map of the image to be detected based on the second anchor point frame by the pedestrian regression main branch to obtain a second whole body detection frame of the target object, wherein the second whole body detection frame corresponds to the second pedestrian detection frame in the image 3.

After obtaining the first initial detection result and the second initial detection result based on the above manner, the step of fusing at least two initial detection results to obtain the final detection result of the target object may include: and fusing the first whole body detection frame and the second whole body detection frame of the target object to obtain a final whole body detection frame of the target object.

In addition, the problem that the fusion difficulty of the first whole-body detection frame and the second whole-body detection frame is large when the image to be detected comprises at least one target object and the number of the generated first whole-body detection frames and the number of the generated second whole-body detection frames are both multiple is considered; to improve this problem, in this embodiment, the following steps 1 and 2 may be referred to filter and process the first whole body detection frame and the second whole body detection frame, and then the first whole body detection frame and the second whole body detection frame are fused.

Step 1, performing confidence threshold filtering on the plurality of first whole-body detection frames and the plurality of second whole-body detection frames to obtain filtered first whole-body detection frames and filtered second whole-body detection frames.

In the target detection, a plurality of detection frames with deviation positions are generated for each target object, each detection frame corresponds to a confidence score, redundant detection frames can be removed based on the confidence scores, and detection frames with high confidence scores are reserved. In a specific implementation manner, firstly, respectively judging whether the confidence of each first whole body detection frame is lower than a preset first confidence threshold; the first confidence threshold is a confidence threshold set based on the local features of the category to which the target object belongs.

Secondly, respectively judging whether the confidence of each second whole body detection frame is lower than a preset second confidence threshold; wherein the second confidence threshold is a confidence threshold set based on the whole-body feature of the category to which the target object belongs.

The first confidence threshold is mainly set based on local features, the second confidence threshold is mainly set based on whole-body features, different setting modes can be used for screening whole-body detection frames from different angles, and in practical application, the first confidence threshold and the second confidence threshold are different, for example, the first confidence threshold is set to be 0.3, and the second confidence threshold is set to be 0.6.

And finally, filtering out the first whole body detection frame lower than the first confidence coefficient threshold value and the second whole body detection frame lower than the second confidence coefficient threshold value to obtain the filtered first whole body detection frame and the filtered second whole body detection frame.

In practical application, the first whole-body detection frame may be directly filtered based on the first confidence threshold, or the local detection frame lower than the first confidence threshold may be filtered to obtain the filtered local detection frame; and then taking the first whole-body detection frame corresponding to the filtered local detection frame as the filtered first whole-body detection frame.

And 2, processing the filtered first whole body detection frame and the filtered second whole body detection frame by adopting a Non-Maximum Suppression (NMS) algorithm to obtain a first whole body detection frame and a second whole body detection frame corresponding to each target object.

The non-maximum suppression algorithm mainly adopts the principle of suppressing elements which are not maximum values, and aims to remove redundant detection frames and keep the best detection frame. For example, regarding the processing manner of the filtered first whole-body detection frame, the first whole-body detection frame a with the highest confidence may be selected first; and then, calculating the overlapping degree between the first whole body detection frame A and other first whole body detection frames, and filtering the first whole body detection frame with the overlapping degree between the first whole body detection frame A and the other first whole body detection frames exceeding a preset overlapping degree threshold value to finally obtain one first whole body detection frame corresponding to each target object.

As for the processing method of the filtered second whole body detection frames, one second whole body detection frame corresponding to each target object can be obtained by referring to the processing method of the filtered first whole body detection frame, and a description thereof is not repeated.

Further, in a specific implementation manner adopted for fusing the first whole body detection frame and the second whole body detection frame of the target object, the first whole body detection frame and the second whole body detection frame of the target object may be merged based on an Intersection-over-Union (IoU) algorithm to obtain a final whole body detection frame of the target object. The merging processing mode based on the intersection ratio algorithm can refer to the following steps (1) to (3):

(1) setting a cutting line based on the position of a first whole-body detection frame of the target object; wherein the cutting line cuts the first whole body detection frame of the target object into a first upper region and a first lower region; the area of the first upper area is not larger than that of the first lower area, and the first upper area comprises a local detection frame; the cutting line cuts the second whole-body detection frame of the target object into a second upper region and a second lower region.

The first whole-body detection frame is obtained by regression of the first anchor frame based on the local features, and the regression of the frame portion corresponding to the local features in the first whole-body detection frame is more accurate, so that cutting lines can be set based on the frame portion corresponding to the local features. Referring to the schematic diagram of the pedestrian detection frames shown in fig. 4, the solid line frame represents a first pedestrian detection frame generated by a first type of anchor point frame set based on the human face feature, and the dotted line frame represents a second pedestrian detection frame generated by a second type of anchor point frame set based on the pedestrian whole-body feature. The upper half part of the first pedestrian detection frame can accurately regress the face, and the lower half part of the first pedestrian detection frame can be too long or too short, for example, the first pedestrian detection frame does not cover the part below the knee of a pedestrian; in order to make the cutting result of the detection frame more accurate, the cutting line may be set to an upper half portion of the first pedestrian detection frame, such as a center line set in a vertical direction of the first pedestrian detection frame. The cutting line cuts a first whole body detection frame of the target object into a first upper region and a first lower region, and cuts a second whole body detection frame of the target object into a second upper region and a second lower region.

(2) An intersection ratio of the first upper region and the second upper region is calculated.

In this embodiment, if the number of target objects is one, the intersection ratio of the first upper region and the second upper region may be directly calculated. If the number of the target objects is plural, the intersection ratio between two of the first upper regions of the plural target objects and the second upper regions of the plural target objects may be calculated. The intersection ratio is a ratio of an intersection and a union of areas of the first upper region and the second upper region. For ease of understanding, the following example description is given: first upper partThe regions include a1, b1, c1, and the second upper region includes a2, b2, d2, then the intersection ratio of a1 and a2 (which may be denoted as IOU) is calculated respectively_a1-a2) Intersection ratio IOU of a1 and b2_a1-b2Intersection ratio IOU of a1 and d2_a1-d2Intersection ratio IOU of b1 and a2_b1-a2Intersection ratio IOU of b1 and b2_b1-b2Intersection ratio IOU of b1 and d2_b1-d2Intersection ratio IOU of c1 and a2_c1-a2Intersection ratio IOU of c1 and b2_c1-b2Intersection ratio IOU of c1 and d2_c1-d2。

The intersection ratio of the first upper region and the second upper region can represent the overlapping degree of the first whole-body detection frame and the second whole-body detection frame, and the intersection ratio of the first whole-body detection frame and the second whole-body detection frame is calculated only by dividing the two whole-body detection frames by the same cutting line, so that the accuracy is high, the calculated amount can be obviously reduced, and the processing efficiency is improved.

(3) And merging the first whole body detection frame and the second whole body detection frame of the target object based on the intersection ratio to obtain a final whole body detection frame of the target object.

Considering that the first anchor frame is obtained based on the local feature setting, the position regression of the corresponding obtained local detection frame is more accurate, but the accuracy of the position regression of the corresponding obtained first whole-body detection frame is slightly lower. The second anchor point frame is obtained based on the whole-body characteristic setting, and the position regression of the correspondingly obtained second whole-body detection frame is accurate. Therefore, when the first whole-body detection frame and the second whole-body detection frame are combined, the first whole-body detection frame with lower accuracy can be replaced with the second whole-body detection frame with higher accuracy. The following describes the merging processing procedure of the detection boxes for one or more scenes in which the number of target objects is one.

For a scene with one number of target objects: and if the intersection ratio is larger than a preset intersection ratio threshold, deleting the first whole body detection frame, and taking the second whole body detection frame as a final whole body detection frame of the target object.

For a scene with a plurality of target objects: the merging process of the detection frames can comprise the following steps a and b:

step a, determining a first upper area and a second upper area belonging to the same target object based on the calculated intersection ratio and a preset intersection ratio threshold. The intersection ratio of the first upper area and the second upper area belonging to the same target object is greater than the intersection ratio of the first upper area and the second upper area belonging to different target objects, and the intersection ratio of the first upper area and the second upper area belonging to the same target object is greater than the intersection ratio threshold.

In practical application, determining a first upper region and a second upper region of which the intersection ratio is greater than a preset intersection ratio threshold (such as 0.3) as a candidate first upper region and a candidate second upper region; and calculating intersection ratios between the candidate first upper regions and the candidate second upper regions, and determining an optimal candidate second upper region corresponding to each candidate first upper region based on the calculated intersection ratios, wherein the optimal candidate second upper region has the highest overlapping degree with the candidate first upper region. For ease of understanding, the following are illustrated: it is calculated that the intersection ratio of the second upper region b2 and the first upper region a1 is the largest, and then the second upper region b2 is the optimal candidate second upper region of the first upper region a1, and similarly, it is assumed that the second upper region b1 is the optimal second upper region corresponding to the first upper region a 2. And determining an optimal candidate first upper region corresponding to each candidate second upper region, such as: the intersection ratio of the first upper region a1 and the second upper region b2 is calculated to be the largest, then the first upper region a1 is the optimal candidate first upper region for the second upper region b2, but the optimal candidate first upper region calculated to determine the second upper region as b1 is c 1. The first upper region a1 and the second upper region b2 are determined as two upper regions that are optimal objects for each other, and the first upper region a1 and the second upper region b2 belong to the same target object.

And b, deleting a first whole body detection frame to which a first upper region of the target object belongs and taking a second whole body detection frame to which a second upper region of the target object belongs as a final whole body detection frame of the target object for the same target object.

It can be understood that, the first whole-body detection frame generated based on the first kind of anchor point frame and the local detection frame and the second whole-body detection frame generated based on the second kind of anchor point frame may include the whole-body detection frames of other target objects in addition to the detection frames which belong to the same target object and can be merged, in practical application, the merged whole-body detection frame which belongs to the same target object and the first whole-body detection frame and the second whole-body detection frame of other target objects can be used as the detection results together, so that the problem of low recall rate of the target object due to occlusion can be alleviated, and the detection accuracy of the target object can be effectively improved; the recall rate can be understood as the ratio of the number of targets in the detection result to the number of targets actually contained in the image to be detected, and can be used for better measuring the recall ratio of the target detection method to the target to be detected. For example, when 10 pedestrians are included in the image to be detected, the lower body of 2 pedestrians is blocked and only the upper body is exposed, and the existing target detection method can generally detect only the whole body of a pedestrian, so that only 8 pedestrians can be detected and 2 pedestrians with a blocked body cannot be detected, and the corresponding recall rate is 80%; the target detection method provided by the embodiment can detect not only 8 unblocked pedestrians but also 2 pedestrians with blocked bodies, and the corresponding recall rate is 100%, so that the accuracy and reliability of target detection are effectively improved.

In summary, in the embodiments, the target detection can be performed on the image to be detected based on at least two anchor point frames set on different features (local features and whole-body features) of the category to which the target object belongs, so as to obtain different initial detection results, and the different initial detection results are fused to obtain a final detection result, so that the influence on the target detection result due to feature loss caused by partial occlusion of the target object is effectively alleviated, and the accuracy of the target detection result is improved.

Example three:

as to the target detection method provided in the second embodiment, an embodiment of the present invention provides a target detection apparatus, referring to a structural block diagram of the target detection apparatus shown in fig. 5, including:

an image obtaining module 502, configured to obtain an image to be detected including a target object;

the target detection module 504 is configured to perform target detection on an image to be detected through at least two preset anchor boxes, and generate at least two initial detection results of a target object; each anchor point frame correspondingly generates an initial detection result, and the at least two anchor point frames comprise a first anchor point frame set based on the local characteristics of the category to which the target object belongs and a second anchor point frame set based on the whole body characteristics of the category to which the target object belongs;

and a fusion module 506, configured to fuse the at least two initial detection results to obtain a final detection result of the target object.

The target detection device provided by the embodiment of the invention can perform target detection on the acquired image to be detected through the preset at least two anchor point frames to generate at least two initial detection results of the target object, and then fuse the at least two initial detection results to obtain the final detection result of the target object; and each anchor point frame correspondingly generates an initial detection result, and the at least two anchor point frames comprise a first anchor point frame set based on the local characteristics of the class to which the target object belongs and a second anchor point frame set based on the whole body characteristics of the class to which the target object belongs. According to the embodiment, the target detection can be performed on the image to be detected based on at least two anchor points arranged on different characteristics (local characteristics and whole-body characteristics) of the category to which the target object belongs to obtain different initial detection results, the different initial detection results are fused to obtain a final detection result, the influence of the target detection result caused by the characteristic loss due to the fact that the target object is partially shielded is effectively relieved, and the accuracy of the target detection result is improved.

In one embodiment, the object detection module 504 is further configured to: performing target detection on an image to be detected based on the first anchor point frame to generate a first initial detection result of a target object; the first initial detection result comprises a local detection frame of the target object and a first whole-body detection frame of the target object; the local detection frame is positioned in the first whole body detection frame; performing target detection on the image to be detected based on the second anchor point frame to generate a second initial detection result of the target object; the second initial detection result includes a second whole-body detection frame of the target object.

In one embodiment, the fusion module 506 is further configured to: and fusing the first whole-body detection frame and the second whole-body detection frame of the target object to obtain a final whole-body detection frame of the target object.

In one embodiment, the fusion module 506 is further configured to: and merging the first whole body detection frame and the second whole body detection frame of the target object based on an intersection and comparison algorithm to obtain a final whole body detection frame of the target object.

In an embodiment, the fusion module 506 specifically includes: a cutting line setting unit for setting a cutting line based on a position of a first whole-body detection frame of the target object; the cutting line cuts the first whole body detection frame of the target object into a first upper region and a first lower region; the area of the first upper area is not larger than that of the first lower area, and the first upper area comprises a local detection frame; the cutting line cuts the second whole-body detection frame of the target object into a second upper region and a second lower region; an intersection ratio calculation unit for calculating an intersection ratio of the first upper region and the second upper region; and the merging processing unit is used for merging the first whole-body detection frame and the second whole-body detection frame of the target object based on the intersection ratio to obtain a final whole-body detection frame of the target object.

In one embodiment, the number of target objects is one; the fusion module 506 is further configured to: and if the intersection ratio is larger than a preset intersection ratio threshold, deleting the first whole body detection frame, and taking the second whole body detection frame as a final whole body detection frame of the target object.

In one embodiment, the number of target objects is plural; the intersection ratio calculation unit is further configured to: calculating intersection ratios between the first upper regions of the plurality of target objects and the second upper regions of the plurality of target objects; the merge processing unit is further configured to: determining a first upper area and a second upper area belonging to the same target object based on the calculated intersection ratio and a preset intersection ratio threshold; the intersection ratio of the first upper area and the second upper area belonging to the same target object is greater than the intersection ratio of the first upper area and the second upper area belonging to different target objects, and the intersection ratio of the first upper area and the second upper area belonging to the same target object is greater than an intersection ratio threshold; for the same target object, a first whole body detection frame to which a first upper region of the target object belongs is deleted, and a second whole body detection frame to which a second upper region of the target object belongs is taken as a final whole body detection frame of the target object.

In one embodiment, the number of target objects is at least one; the number of the first whole body detection frames and the number of the second whole body detection frames are multiple; the target detection device further comprises a filtering module (not shown in the figure); the filter module is used for: performing confidence threshold filtering on the plurality of first whole-body detection frames and the plurality of second whole-body detection frames to obtain filtered first whole-body detection frames and filtered second whole-body detection frames; and processing the filtered first whole-body detection frame and the filtered second whole-body detection frame by adopting a non-maximum suppression algorithm to obtain a first whole-body detection frame and a second whole-body detection frame corresponding to each target object.

In one embodiment, the filter module is further configured to: respectively judging whether the confidence of each first whole body detection frame is lower than a preset first confidence threshold; the first confidence threshold is a confidence threshold set based on the local features of the category to which the target object belongs; respectively judging whether the confidence of each second whole-body detection frame is lower than a preset second confidence threshold; the second confidence threshold is a confidence threshold set based on the whole-body features of the category to which the target object belongs; and filtering the first whole-body detection frame lower than the first confidence coefficient threshold value and the second whole-body detection frame lower than the second confidence coefficient threshold value to obtain the filtered first whole-body detection frame and the filtered second whole-body detection frame.

The device provided in this embodiment has the same implementation principle and the same technical effects as those of the foregoing embodiment, and for the sake of brief description, reference may be made to the corresponding contents in the foregoing method embodiment where no part of the embodiment of the device is mentioned.

Example four:

as to the target detection method provided in the second embodiment, an embodiment of the present invention provides a target detection system, where the system includes: the system comprises an image acquisition device, a processor and a storage device; the image acquisition device is used for acquiring an image to be detected; the storage device has stored thereon a computer program which, when executed by the processor, performs the object detection method as in embodiment two above.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the system described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.

Further, the present embodiment also provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the object detection method provided by the foregoing method embodiments.

The computer program product of the target detection method, apparatus, and system provided in the embodiments of the present invention includes a computer readable storage medium storing a program code, where instructions included in the program code may be used to execute the method described in the foregoing method embodiments, and specific implementation may refer to the method embodiments, which are not described herein again.

In addition, in the description of the embodiments of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in a specific case to those of ordinary skill in the art.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplification of description, but do not indicate or imply that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: those skilled in the art can still make modifications or changes to the embodiments described in the foregoing embodiments, or make equivalent substitutions for some features, within the scope of the disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of target detection, comprising:

acquiring an image to be detected containing a target object;

performing target detection on the image to be detected through at least two preset anchor point frames to generate at least two initial detection results of the target object; each anchor frame generates an initial detection result correspondingly, the at least two anchor frames include a first anchor frame set based on the local features of the class to which the target object belongs and a second anchor frame set based on the whole-body features of the class to which the target object belongs, a first initial detection result corresponding to the first anchor frame includes a local detection frame of the target object and a first whole-body detection frame of the target object, and a second initial detection result corresponding to the second anchor frame includes a second whole-body detection frame of the target object;

and fusing the first whole-body detection frame and the second whole-body detection frame of the target object to obtain a final whole-body detection frame of the target object.

2. The method according to claim 1, wherein the step of performing target detection on the image to be detected through at least two preset anchor boxes to generate at least two initial detection results of the target object comprises:

performing target detection on the image to be detected based on the first type of anchor frame to generate a first type of initial detection result of the target object; the local detection frame is positioned in the first whole body detection frame;

and carrying out target detection on the image to be detected based on the second type of anchor point frame to generate a second type of initial detection result of the target object.

3. The method according to claim 1, wherein the step of fusing the first whole-body detection frame and the second whole-body detection frame of the target subject to obtain the final whole-body detection frame of the target subject comprises:

and combining the first whole body detection frame and the second whole body detection frame of the target object based on an intersection-comparison algorithm to obtain a final whole body detection frame of the target object.

4. The method according to claim 3, wherein the step of merging the first whole-body detection frame and the second whole-body detection frame of the target object based on the cross-over-comparison algorithm to obtain the final whole-body detection frame of the target object comprises:

setting a cutting line based on the position of a first whole body detection frame of the target object; wherein the cutting line cuts a first whole-body detection frame of the target object into a first upper region and a first lower region; the area of the first upper region is not greater than the first lower region, and the first upper region includes the local detection box; the cutting line cuts a second whole-body detection frame of the target object into a second upper region and a second lower region;

calculating an intersection ratio of the first upper region and the second upper region;

and merging the first whole-body detection frame and the second whole-body detection frame of the target object based on the intersection ratio to obtain a final whole-body detection frame of the target object.

5. The method of claim 4, wherein the number of target objects is one;

the step of combining the first whole-body detection frame and the second whole-body detection frame of the target object based on the intersection ratio to obtain a final whole-body detection frame of the target object includes:

and if the intersection ratio is larger than a preset intersection ratio threshold, deleting the first whole body detection frame, and taking the second whole body detection frame as a final whole body detection frame of the target object.

6. The method of claim 4, wherein the number of target objects is plural;

the step of calculating the intersection ratio of the first upper region and the second upper region includes:

calculating intersection ratios between every two first upper regions of the target objects and second upper regions of the target objects;

determining a first upper area and a second upper area belonging to the same target object based on the calculated intersection ratio and a preset intersection ratio threshold; the intersection ratio of the first upper area and the second upper area belonging to the same target object is greater than the intersection ratio of the first upper area and the second upper area belonging to different target objects, and the intersection ratio of the first upper area and the second upper area belonging to the same target object is greater than the intersection ratio threshold;

for the same target object, a first whole body detection frame to which a first upper region of the target object belongs is deleted, and a second whole body detection frame to which a second upper region of the target object belongs is taken as a final whole body detection frame of the target object.

7. The method of claim 3, wherein the number of target objects is at least one; the number of the first whole body detection frames and the number of the second whole body detection frames are both multiple;

before the step of merging the first whole-body detection frame and the second whole-body detection frame of the target object based on the cross-comparison algorithm, the method further includes:

performing confidence threshold filtering on the plurality of first whole-body detection frames and the plurality of second whole-body detection frames to obtain filtered first whole-body detection frames and filtered second whole-body detection frames;

and processing the filtered first whole-body detection frame and the filtered second whole-body detection frame by adopting a non-maximum suppression algorithm to obtain a first whole-body detection frame and a second whole-body detection frame corresponding to each target object.

8. The method of claim 7, wherein the step of performing confidence threshold filtering on the plurality of first whole body detection boxes and the plurality of second whole body detection boxes to obtain filtered first whole body detection boxes and filtered second whole body detection boxes comprises:

respectively judging whether the confidence of each first whole body detection frame is lower than a preset first confidence threshold; wherein the first confidence threshold is a confidence threshold set based on the local feature of the category to which the target object belongs;

respectively judging whether the confidence of each second whole-body detection frame is lower than a preset second confidence threshold; wherein the second confidence threshold is a confidence threshold set based on a whole-body feature of a class to which the target object belongs;

and filtering out the first whole body detection frame lower than the first confidence coefficient threshold value and the second whole body detection frame lower than the second confidence coefficient threshold value to obtain the filtered first whole body detection frame and the filtered second whole body detection frame.

9. An object detection device, comprising:

the image acquisition module is used for acquiring an image to be detected containing a target object;

the target detection module is used for carrying out target detection on the image to be detected through at least two preset anchor point frames to generate at least two initial detection results of the target object; each anchor frame generates an initial detection result correspondingly, the at least two anchor frames include a first anchor frame set based on the local features of the class to which the target object belongs and a second anchor frame set based on the whole-body features of the class to which the target object belongs, a first initial detection result corresponding to the first anchor frame includes a local detection frame of the target object and a first whole-body detection frame of the target object, and a second initial detection result corresponding to the second anchor frame includes a second whole-body detection frame of the target object;

and the fusion module is used for fusing the first whole body detection frame and the second whole body detection frame of the target object to obtain a final whole body detection frame of the target object.

10. An object detection system, characterized in that the system comprises: the system comprises an image acquisition device, a processor and a storage device;

the image acquisition device is used for acquiring an image to be detected;

the storage device has stored thereon a computer program which, when executed by the processor, performs the method of any one of claims 1 to 8.

11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of the preceding claims 1 to 8.