WO2020181872A1

WO2020181872A1 - Object detection method and apparatus, and electronic device

Info

Publication number: WO2020181872A1
Application number: PCT/CN2019/126435
Authority: WO
Inventors: 李作新; 俞刚; 袁野
Original assignee: 北京旷视科技有限公司
Priority date: 2019-03-12
Filing date: 2019-12-18
Publication date: 2020-09-17
Also published as: CN109948497B; CN109948497A

Abstract

The present application relates to the technical field of image recognition and provided thereby are an object detection method and apparatus, and an electronic device. The method comprises: obtaining an image to be processed that comprises one or more detection objects; performing object detection on the image to be processed to obtain at least one pre-selection box, the pre-selection box comprising a visible box and/or a complete box, the complete box being a bounding box for an entire detection object, and the visible box being a bounding box of each detection object in a visible region in the image to be processed; by means of a correlation modeling model, determining a group to which each pre-selection box among the at least one pre-selection box belongs so as to obtain at least one pre-selection box group, pre-selection boxes in the same pre-selection box group belonging to the same detection object; performing de-duplication processing on each pre-selection box group to obtain a pre-selection box group after de-duplication processing; and determining a target detection box of each detection object on the basis of the pre-selection box group after de-duplication processing. In the present application, the missed detection of a detection object can be effectively avoided.

Description

Object detection method, device and electronic equipment

Cross references to related applications

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on March 12, 2019, with the application number CN201910186133.5 and titled "An object detection method, device and electronic equipment", the entire content of which is incorporated by reference In this application.

Technical field

This application relates to the technical field of image processing, and in particular to an object detection method, device and electronic equipment.

Background technique

Object detection is one of the classic problems in computer vision. Its task is to mark the position of the object in the image with a bounding box and give the object category. From the traditional artificially designed feature plus shallow classifier framework to the end-to-end detection framework based on deep learning, object detection has become more and more mature. At present, under the condition that multiple objects, especially similar objects appear intensively, and there are occlusions between objects, the existing object detection algorithms only consider object detection at the category level, which results in the existing technology not being able to perform well under occlusion. Perform accurate object detection. In the case of mutual occlusion between objects, the method in the prior art often causes the problem that the occluded object and the occluded object cannot be effectively distinguished, resulting in missed detection of the occluded object.

Summary of the invention

In view of this, the purpose of this application is to provide an object detection method, device, and electronic equipment. This application alleviates the technical problem that similar objects are prone to miss detection when objects are detected under dense obstruction in the prior art.

In a first aspect, an embodiment of the present application provides an object detection method, including: acquiring a to-be-processed image containing one or more detection objects; performing object detection on the to-be-processed image to obtain at least one preselected box, wherein The pre-selection frame includes a visible frame and/or a complete frame, the complete frame is an enclosing frame for a detection object as a whole, and the visible frame is an enclosing frame of the visible area of each detection object in the image to be processed; The sexual modeling model determines the group to which each preselection box belongs in the at least one preselection box to obtain at least one preselection box group; the preselection boxes in the same preselection box group belong to the same detection object; each preselection box group is deduplicated Processing to obtain a pre-select box group after the de-duplication processing; determine the target detection frame of each detection object based on the pre-select box group after the de-duplication processing.

Further, determining the group to which each preselection box belongs in the at least one preselection box through an associative modeling model, and obtaining at least one preselection box group includes: obtaining the attribute feature projection network of the associated modeling model At least one pre-selected box’s attribute feature vector for each pre-selected box; through the clustering module of the association modeling model, each pre-selected box in the at least one pre-selected box is determined based on the attribute feature vector of each pre-selected box The group belongs to, and the at least one preselection box group is obtained.

Further, the instance attribute feature projection network is obtained through Lpull loss function and Lpush loss function training; wherein, the distance between the attribute feature vectors of the preselection boxes belonging to the same detection object is shortened by the Lpull loss function, and the Lpush loss function is used to reduce The distance between the attribute feature vectors of the preselection boxes belonging to different detection objects is widened.

Further, through the clustering module of the association modeling model, the group to which each preselection box belongs to the at least one preselection box is determined based on the attribute feature vector of each preselection box to obtain the at least one preselection box The group includes: calculating the vector distance value between any two of the attribute feature vectors to obtain multiple vector distance values; adding two preselected boxes that are less than a preset threshold among the multiple vector distance values to the same group, Each other preselection box that is not added to the group is separately regarded as a group; the at least one obtained group is clustered and grouped by a clustering algorithm to obtain the at least one preselection box group.

Further, each of the preselection box groups includes a visible box group and a complete box group; performing deduplication processing on each preselection box group to obtain the preselection box after the deduplication processing includes: The visible frame group undergoes de-duplication processing to obtain the visible frame group after the de-duplication processing; determining the target detection frame of each detection object based on the pre-selected frame group after the de-duplication processing includes: based on the visible frame after the de-duplication processing The group and the complete frame group determine the target detection frame of each detection object.

Further, performing deduplication processing on the visible frame group in the at least one preselection frame group to obtain the visible frame group after the deduplication processing includes: using a non-maximum value suppression algorithm to deduplicate the visible frame group in the at least one preselection frame group The frame group undergoes de-duplication processing to obtain the visible frame group after the de-duplication processing.

Further, determining the target detection frame of each detection object based on the visible frame group after the deduplication processing and the complete frame group includes: performing local features on each visible frame in the visible frame group after the deduplication processing Aligning processing; and performing local feature alignment processing on each complete frame in the complete frame group; inputting the visible frame after the feature alignment processing and the complete frame after the feature alignment processing to the target detection model for detection processing to obtain the The position coordinates and classification probability value of the visible frame after the feature alignment processing, and the position coordinates and classification probability value of the complete frame after the feature alignment processing are obtained; the target detection frame of each detection object is determined based on the target position coordinates and the target classification probability value, Wherein, the target position coordinates include: the visible frame position coordinates after the feature alignment process and/or the position coordinates of the complete frame after the feature alignment process, and the target classification probability value includes: after the feature alignment process The classification probability value of the visible frame and/or the classification probability value of the complete frame after the feature alignment processing.

Further, determining the target detection frame of each detection object based on the target position coordinates and the target classification probability value includes: using the target classification probability value as the weight of the corresponding target position coordinates; The target position coordinates of the object are calculated with a weighted average value to obtain the target detection frame of the detection object; the target detection frame includes a visible target frame and/or a complete target frame.

Further, performing object detection on the to-be-processed image to obtain at least one preselection box includes: inputting the to-be-processed image into a feature pyramid network for processing to obtain a feature pyramid; and using a regional candidate network RPN model to analyze the feature pyramid Processing is performed to obtain the at least one preselection box, wherein each preselection box in the at least one preselection box carries an attribute label, and the attribute label is configured to determine the type of each preselection box, and the type includes a complete box And visible frame.

In a second aspect, an embodiment of the present application also provides an object detection device, including: an image acquisition unit configured to acquire a to-be-processed image containing one or more detection objects; and a preselected frame acquisition unit configured to Object detection is performed on the image to be processed to obtain at least one pre-selected box, wherein the pre-selected box includes a visible frame and/or a complete frame, the complete frame is an enclosing frame for a detection object as a whole, and the visible frame is each detection The bounding box of the visible area of the object in the image to be processed; the grouping unit is configured to determine the group to which each preselection box belongs in the at least one preselection box through the association modeling model, to obtain at least one preselection box group; The preselection boxes in the preselection box group belong to the same detection object; the deduplication unit is configured to perform deduplication processing on each preselection box group to obtain the preselection box group after the deduplication processing; the determining unit is configured to be based on the The preselected box group after the deduplication process determines the target detection box of each detection object.

In a third aspect, an embodiment of the present application also provides an electronic device, including a memory and a processor, the memory stores a computer program that can run on the processor, and when the processor executes the computer program Steps to implement the above method.

In a fourth aspect, the embodiments of the present application also provide a computer-readable medium having non-volatile program code executable by a processor, and the program code causes the processor to execute the foregoing method.

In the embodiment of the present application, firstly, an image to be processed containing one or more detection objects is acquired; then, object detection is performed on the image to be processed to obtain at least one preselected box; next, each of the at least one preselected box is determined The group to which the preselection box belongs has at least one preselection box group. In the embodiment of the application, at least one preselection box group is obtained by determining the group to which each preselection box belongs. Since the preselection boxes in the same preselection box group belong to the same detection object, the preselection boxes belonging to different detection objects are distinguished by the preselection box group Open to prevent the preselection box of the occluded object from being removed as the redundant preselection box of the occluded object during the de-duplication process, which alleviates the technology that similar objects are prone to missed detection when the existing technology detects objects under dense occlusion. The problem is to realize the detection of one or more detection objects in the image to be processed, and effectively avoid the missed detection of the detection objects.

At the same time, at least one preselection box group is determined through the relevance modeling model. The relevance modeling model is realized by the neural network. After at least one preselection box is input to the relevance modeling model, the feature information and preselection of the image in the preselection box are fully utilized The position information of the frame groups the pre-selected boxes, which can effectively distinguish the pre-selected boxes of different detection objects, especially for dense object occlusion scenes, when the occlusion object and the occluded object have a high degree of overlap in the complete frame, the position can be adjacent Preselection boxes that are similar to the size but belong to different detection objects are accurately grouped.

Other features and advantages of the present disclosure will be described in the following specification, or some of the features and advantages can be inferred from the specification or determined without doubt, or can be learned by implementing the above-mentioned technology of the present disclosure.

In order to make the above-mentioned objects, features and advantages of the present disclosure more obvious and understandable, preferred embodiments are described in detail below in conjunction with accompanying drawings.

Description of the drawings

In order to more clearly illustrate the specific embodiments of this application or the technical solutions in the prior art, the following will briefly introduce the drawings that need to be used in the specific embodiments or the description of the prior art. Obviously, the appendix in the following description The drawings are some embodiments of the application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.

Fig. 1 is a schematic diagram of an electronic device according to an embodiment of the present application;

Fig. 2 is a flowchart of an object detection method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a visible frame and a complete frame that densely block similar objects according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a corresponding relationship between a preselection box and a detection object according to an embodiment of the present application;

Fig. 5 is a schematic diagram of an object detection device according to an embodiment of the present application.

detailed description

In order to make the purpose, technical solutions, and advantages of the embodiments of the present application clearer, the technical solutions of the present application will be described clearly and completely with reference to the accompanying drawings. Obviously, the described embodiments are part of the embodiments of the present application, not all of them.的实施例。 Example. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

Example one:

First, an example electronic device 100 configured to implement the object detection method of an embodiment of the present application is described with reference to FIG. 1.

As shown in FIG. 1, the electronic device 100 includes one or more processors 102, one or more storage devices 104, an input device 106, an output device 108, and a camera 110. These components are connected through a bus system 112 and/or other forms The mechanisms (not shown) are interconnected. It should be noted that the components and structure of the electronic device 100 shown in FIG. 1 are only exemplary and not restrictive, and the electronic device may also have other components and structures as required.

The processor 102 may be implemented in at least one hardware form among a digital signal processor (DSP), a field programmable gate array (FPGA), a programmable logic array (PLA), and an ASIC (Application Specific Integrated Circuit), so The processor 102 may be a central processing unit (CPU) or another form of processing unit with data processing capability and/or instruction execution capability, and may control other components in the electronic device 100 to perform desired functions.

The storage device 104 may include one or more computer program products, and the computer program products may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include random access memory (RAM) and/or cache memory (cache), for example. The non-volatile memory may include, for example, read only memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 102 may run the program instructions to implement the client functions (implemented by the processor) in the embodiments of the present application described below. And/or other desired functions. Various application programs and various data, such as various data used and/or generated by the application program, can also be stored in the computer-readable storage medium.

The input device 106 may be a device used by a user to input instructions, and may include one or more of a keyboard, a mouse, a microphone, and a touch screen.

The output device 108 may output various information (for example, images or sounds) to the outside (for example, a user), and may include one or more of a display, a speaker, and the like.

The camera 110 is configured to obtain a to-be-processed image, wherein the to-be-processed image obtained by the camera is processed by the object detection method to obtain a target detection frame of the detected object. For example, the camera can take an image (such as a photo) desired by the user. And video, etc.), and then the image is processed by the object detection method to obtain the target detection frame of the detection object, and the camera may also store the captured image in the memory 104 for use by other components.

Exemplarily, the example electronic device configured to implement the object detection method according to the embodiment of the present application may be implemented on a mobile terminal such as a smart phone and a tablet computer.

Embodiment two:

According to the embodiments of the present application, an embodiment of an object detection method is provided. It should be noted that the steps shown in the flowchart of the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions, and, although The logical sequence is shown in the flowchart, but in some cases, the steps shown or described may be performed in a different order than here.

Fig. 2 is a flowchart of an object detection method according to an embodiment of the present application. As shown in Fig. 2, the method includes the following steps:

Step S202: Obtain a to-be-processed image containing one or more detection objects.

In the embodiments of the present application, the image to be processed may include multiple types of detection objects, for example, including humans and non-humans. Among them, non-humans include dynamic objects and static objects, and dynamic objects may be animal objects. , Static objects can be objects that are at rest except humans and animals.

In each image to be processed, there can be multiple categories of objects, and there can be one or more objects of each category. For example, there are 2 people and 3 dogs in the image. The various objects in the image to be processed can be displayed independently of each other, or some of the objects may be blocked by other objects, resulting in not being fully displayed.

It should be noted that the detection object may be one or more types of objects in the object detection step to be performed in the image to be processed. The user can determine the category of the detection object according to actual needs, which is not specifically limited in this embodiment.

It should be further noted that, in this embodiment, the image to be processed may be an image captured by the camera of the electronic device in Embodiment 1, or may be an image pre-stored in the memory of the electronic device. In this embodiment, There is no specific restriction on this.

Step S204: Perform object detection on the image to be processed to obtain at least one preselected box, where the preselected box includes a visible frame and/or a complete frame, and the complete frame is an enclosing frame for an entire detection object. The visible frame is a bounding frame of the visible area of each detection object in the image to be processed.

In the embodiment of the present application, after the image to be processed is acquired, object detection can be performed on the image to be processed through the pre-selection box detection network. The process of object detection for the image to be processed is to perform object detection on the unobstructed detection object in the image to be processed to output a complete frame. The process can also be: object detection for the occluded object in the image to be processed and output the complete frame at the same time And visible frame.

Multiple visible frames or multiple complete frames may be generated for the same detection object, and different visible frames or different complete frames may have different scales relative to the image to be processed.

Step S206: Determine the group to which each preselection box belongs in the at least one preselection box through the association modeling model to obtain at least one preselection box group; the preselection boxes in the same preselection box group belong to the same detection object.

In the embodiment of the present application, after object detection, a plurality of preselected boxes are generated for different detected objects, wherein the preselected boxes include visible boxes and/or complete boxes. Usually, the pre-selected boxes contained in the detection result are redundant and need to be de-duplicated. In order to prevent the pre-selected box of the occluded object from being removed as the redundant pre-selected box of the occluded object during the de-duplication process, each pre-selected box needs to be determined For the group to which it belongs, a preselection box group is obtained corresponding to one group, and at least one preselection box group can be obtained, so that the preselection boxes belonging to different detection objects can be distinguished by the preselection box group. The relevance modeling model is a model that can obtain the relevance of the input data, which can be realized by a neural network. After at least one preselection box is input into the relevance modeling model, the relevance modeling model will be based on the preselection box The feature information of the image is combined with the position information of the preselection box to effectively group the preselection boxes.

By grouping at least one preselection box in the above manner, the preselection boxes belonging to the same detection object can be grouped into a preselection box group. Since the preselection box group of the same detection object may include both visible and complete frames, the The preselection box group can also include a visible box group and a complete box group at the same time.

It should be noted that, as shown in FIG. 4, a schematic diagram of the correspondence between a preselection box and a detection object is shown. In the figure, the detection object includes an occluded object P and an occluded object Q that is occluded by the occluded object P. The preselection box includes frame No. 7 To the twelfth box 12. The seventh frame 7, the eighth frame 8 and the ninth frame 9 all belong to the occluded object P in the figure, and the tenth frame 10, the eleventh frame 11 and the twelfth frame 12 all belong to the occluded object Q in the figure. The seventh box 7, the eighth box 8 and the ninth box 9 form a preselection box group, and the tenth box 10, the eleventh box 11 and the twelfth box 12 form another preselection box group.

After obtaining the grouping of each preselection box in the 7th box 7 to the 12th box 12, and obtaining the preselection box group, the preselection box in each preselection box group can be de-duplicated separately to prevent different objects In the case of a high degree of coincidence of the preselected boxes, the confusion of the frames between different objects will occur to prevent the preselected box of the occluded object Q (for example, the tenth box 10) as the redundant preselected box of the occluded object P in the process of deduplication. Removal greatly reduces the probability of missed detection of blocked objects.

Step S208: Perform deduplication processing on each preselection box group to obtain the preselection box group after deduplication processing.

In the embodiment of the present application, after the object to which the preselection box belongs is determined, the preselection box group of each detection object is deduplicated separately. By grouping and deduplication, the preselection boxes of different detection objects are not confused with each other. Specifically This avoids removing the preselection box of the occluded object as a redundant preselection box of the occlusion object in the process of deduplication, thereby avoiding the problem of missed detection of the occluded object.

Step S210: Determine the target detection frame of each detection object based on the preselected frame group after the de-duplication processing.

In the embodiment of the present application, after the preselection box group after the deduplication processing is obtained, the target detection frame of each detection object may be determined based on the preselection box group after the deduplication processing. If the detection object is not occluded in the image to be processed, the target detection frame of the detection object includes the target complete frame; if the detection object is occluded in the image to be processed, the target detection frame of the detection object includes the target complete frame and Target visible frame. The target complete frame can be configured to obtain the position information of the detection object and the image feature information of the detection object other than the occluded object; the target visible frame can be configured to obtain the image feature information of the occluded object. The application embodiment can obtain two types of target detection frames, and thus can obtain more comprehensive and more accurate detection object information for subsequent image processing such as recognition and verification.

In the embodiment of the present application, the foregoing step S202 to step S210 may be executed by the processor in the electronic device in the foregoing embodiment 1.

It should be noted that all processors capable of executing steps S202 to S210 described above can be applied in the embodiments of the present application, and there is no specific limitation on this.

In addition, in a dense object occlusion scene, the complete frame of the occluded object and the occluded object have a high degree of overlap. Only by using the position and size of the complete frame, the complete frame of different detection objects cannot be effectively distinguished, and the grouping effect is poor. Therefore, the complete frame cannot be effectively removed. In the embodiment of the present application, the correlation modeling model is implemented by a neural network. After at least one preselection box is input into the correlation modeling model, the feature information of the image in the preselection box and the position information of the preselection box are effectively used to perform the preselection. Grouping can effectively distinguish the pre-selected boxes of different detection objects, especially for dense object occlusion scenes, when the occluded object and the full frame of the occluded object have a high degree of coincidence, the location and size are similar, but they belong to different detections Preselection boxes of objects are accurately grouped.

The embodiments of the present application will be described in detail below in conjunction with specific implementation manners.

It can be seen from the above description that in this embodiment, the image to be processed containing one or more detection objects is first acquired. After that, object detection can be performed on the image to be processed to obtain at least one preselection box.

In an optional implementation, step S204, performing object detection on the image to be processed to obtain at least one preselected box includes the following steps:

Step S2041, input the to-be-processed image into a feature pyramid network for processing to obtain a feature pyramid;

Step S2042: Process the feature pyramid by using a regional candidate network RPN (Region Proposal Networks) model to obtain the at least one preselection box, wherein each preselection box in the at least one preselection box carries an attribute label, and The attribute label is configured to determine the type of each pre-selected box, the type includes a complete box and a visible box.

It can be seen from the above description that in the embodiment of the present application, the feature pyramid network is configured to generate a feature pyramid. Basic network models such as VGG (Visual Geometry Group) 16 model, Resnet or FPN (Feature Pyramid Networks) can be selected as the feature pyramid network. In this embodiment, the image to be processed may be input into the feature pyramid network for processing to obtain a feature pyramid.

Before using the regional candidate network RPN (Region Proposal Networks) model to process the feature pyramid, it is necessary to train the regional candidate network RPN model through a preset training set. In this embodiment, the basic network model (for example, FPN) and The RPN model is trained together. Among them, the preset training set includes multiple training samples, and each training sample includes: a training image and its corresponding image label. Wherein, the image label is configured to mark the type of the preselection box in the training image, and the type includes a complete box or a visible box. In this application, multiple training samples can be used to train the RPN model, so that the RPN model can recognize and identify the preselection box type in the image.

After training the basic network model and the regional candidate network RPN model using the above-mentioned preset training set, the trained regional candidate network RPN model can be used to process the feature pyramid to obtain at least one preselection box, and each preselection box An attribute label, which is configured to characterize whether the preselected box is a visible box or a complete box.

Specifically, the attribute label may be expressed as "1" or "2", for example, "1" indicates that the preselected box is a visible frame, and "2" indicates that the preselected box is a complete frame. In addition to "1" and "2", other data that can be recognized by the machine can also be selected as the attribute tag, which is not specifically limited in this embodiment.

In this embodiment, by processing the feature pyramid through the regional candidate network RPN model, more accurate preselection box detection results can be obtained.

After obtaining a more accurate preselection box detection result, the group to which each preselection box belongs in the at least one preselection box can be determined, and at least one preselection box group can be obtained.

In an optional embodiment, step S206, determining the group to which each preselection box belongs to the at least one preselection box through the association modeling model, and obtaining at least one preselection box group includes the following steps:

Step S11: Obtain the attribute feature vector of each preselection box in the at least one preselection box through the instance attribute feature projection network of the association modeling model;

Step S12, through the clustering module of the association modeling model, determine the group to which each preselection box belongs in the at least one preselection box based on the attribute feature vector of each preselection box to obtain the at least one preselection box group.

In the embodiment of the present application, the association modeling model may be an associate embedding model. The instance attribute feature projection network in the association modeling model can be an embedding encoding (also called embedded coding) network. Input at least one preselection box into the embedding encoding network in the association modeling model, and return a corresponding attribute feature vector for each preselection box, and each preselection box corresponds to an attribute feature vector. Then, the preselection boxes of the same detection object are grouped into the same group by the clustering module according to the attribute feature vector, and different groups correspond to different detection objects.

Before using the Associate embedding model to determine the group to which each preselection box belongs, it is also necessary to train the embedding encoding network in the Associate embedding model. To determine which attribute feature vector is output by the embedding encoding network. In the training process, the constraint condition of the above attribute feature vector is the distance between the attribute feature vectors, which can be Euclidean distance or cosine distance. The first constraint condition can be used to narrow the distance between the attribute feature vectors of the preselection boxes belonging to the same detection object, and then the preselection boxes belonging to the same detection object can be added to the same group through the attribute feature vector; through the second constraint The condition extends the distance between the attribute feature vectors of the preselection boxes belonging to different detection objects, and then the preselection boxes belonging to different detection objects are added to different groups through the attribute feature vectors. Specifically, the first constraint condition may be the Lpull loss function, and the second constraint condition may be the Lpush loss function. The Lpull loss function can be used to train the embedding encoding network first, and then the Lpush loss function can be used to train the embedding encoding network to extend the distance; it is also possible to use both the Lpull loss function and the Lpush loss function to train the embedding encoding network.

It should be noted that the above Lpull loss function has the form:

Among them, M is the number of attribute feature vectors, e _k and e _j both represent arbitrary attribute feature vectors, and C _m represents the number of attribute feature vectors corresponding to the corresponding detection object; the above Lpush loss function has the form:

Among them, M is the number of attribute feature vectors, e _k and e _j both represent arbitrary attribute feature vectors, and Δ represents a preset distance value.

The embedding encoding network training is completed, and after the preselection box is obtained through the regional candidate network RPN model, the embedding encoding network is used to obtain the attribute feature vector of each preselection box, that is, the embedding value (embedding value) is obtained. The embedding value can be an N-dimensional vector, and an N-dimensional vector is obtained for each preselection box. The N-dimensional vector can be expressed as: [a ₁ , a ₂ ,..., a _N ].

In the embodiment of the present application, the purpose of obtaining the attribute feature vector is to distinguish different object instances (instances, that is, detection objects) in the preselection box. The feature vector needs to have an instance-level distinguishing ability and be able to distinguish each detection object. It is not only the distinguishing ability at the category level (to distinguish the types of detection objects), so there are certain requirements for the selection of the feature extraction network, and the attribute feature vector embedding value obtained by the instance attribute feature projection network has a good instance level Distinguish ability.

In addition, the attribute feature vector embedding encoding is generated by directly optimizing the association relationship of the actual preselection box using the association modeling model with grouping relationship, Associate embedding, and is directly optimized according to the preselection box grouping task, so it can be obtained more directly And good performance improvement.

Furthermore, the instance attribute feature projection network is implemented by a neural network, which can be integrated with the detection network of the preselection box (such as the feature pyramid network and the regional candidate network RPN), and the two share the basic features of the network, reducing the amount of calculation. In addition, the detection network training process of the preselection box can be directly combined with the instance attribute feature projection network to realize the joint training of the two overall networks without adding other external information, and the training process is relatively simple.

Further, after obtaining the above-mentioned N-dimensional vector, it is possible to judge whether the two different pre-select boxes belong to the same group by comparing the size of the Euclidean distance between the N-dimensional vectors of the two different pre-select boxes. Whether the two different pre-select boxes belong to the same detection object.

The Euclidean distance between two N-dimensional vectors can be judged by setting a preset threshold. For example, for the preset threshold x, if the Euclidean distance between the N-dimensional vectors of two different preselection boxes is less than x, the distance between the two preselection boxes is considered to be small, and they are considered to belong to the same group.

For other preselected boxes, the above method is used to determine the group to which they belong, and they will not be introduced here.

Through the above processing method, the detection object to which each preselection box belongs can be accurately determined, thereby further reducing the probability of missed detection of the detection object.

Optionally, the clustering module of the association modeling model determines the group to which each preselection box belongs to the at least one preselection box based on the attribute feature vector of each preselection box to obtain the at least one preselection box The frame group can be realized by the following implementation:

Step S1: Calculate the vector distance value between any two of the attribute feature vectors to obtain multiple vector distance values;

Step S2, adding two preselected boxes that are less than a preset threshold among the plurality of vector distance values to the same group, and each other preselected box that is not added to the group is separately regarded as a group;

Step S3, clustering the at least one obtained group by a clustering algorithm, to obtain the at least one preselected box group.

In the embodiment of this application, the above-mentioned embedding encoding network is used to regress all preselection boxes to obtain the attribute feature vector, and the vector distance value between any two attribute feature vectors is calculated respectively. The vector distance value can be calculated by the Euclidean distance equal distance calculation method .

After that, all the vector distance values obtained are respectively compared with the size of the preset threshold, where the size of the preset threshold can be determined according to actual needs or experience, which is not specifically limited in this embodiment. If the vector distance value is less than the preset threshold, it can be determined that the vector distance value is the target vector distance value. It is considered that the two preselected boxes corresponding to the target vector distance value correspond to the same detection object. Therefore, the target vector distance value corresponds to the target vector distance value. Two preselected boxes are added to the same group. The preselected boxes corresponding to the attribute feature vectors whose vector distances from other attribute feature vectors are not less than the preset threshold value are respectively regarded as a group. Thus, at least one group can be obtained.

It should be noted that if the two sets of pre-selected boxes corresponding to two different target vector distance values have the same pre-selected box, that is, two different target vector distance values correspond to three different pre-selected boxes. Different preselection boxes are added to the same group.

After at least one group is obtained, the at least one obtained group is clustered and grouped by a clustering algorithm.

It should be noted that the clustering algorithm can be a commonly used algorithm, for example, it can be a K-means clustering algorithm (K-means) or a mean shift clustering algorithm.

For example, among the objects to be processed, there are preselection boxes f1-f8, and four detection objects A, B, C, and D. For the preselection boxes f1-f8, the embedding encoding algorithm is used to return the attribute feature vector, that is, the embedding value. Calculate the vector distance value between any two attribute feature vectors, and then filter out the target vector distance values that are less than the preset threshold from multiple vector distance values, which are s1-s4, where s1 is the preselection box of f1 and f2 The vector distance between F2 and F3, s3 is the vector distance between F4 and F5, and s4 is the vector distance between F5 and F8. value. According to the above information, add the f1 and f2 preselection boxes corresponding to the target vector distance value s1 to the same group, and add the f2 and f3 preselection boxes corresponding to the target vector distance value s2 to the same group, because the f1 and f2 preselection boxes are already In the same group, preselection boxes f2 and f3 are also in the same group, therefore, preselection boxes f1, f2 and f3 are in the same group, and similarly, preselection boxes f4, f5 and f8 are in the same group. Since the vector distance between the attribute feature vector of the preselection boxes f6 and f7 and any feature vector is not less than the preset threshold, the preselection boxes f6 and f7 are respectively regarded as a group. The grouping result includes four groups. One group includes preselection boxes f1, f2, and f3; one group includes preselection boxes f4, f5, and f8; one group includes preselection box f6; and one group includes preselection box f7. According to the obtained 4 groups and then cluster grouping, 4 preselection box groups can be obtained.

For another example, there are preselection boxes f1-f4 and three detection objects A, B, and C in the image to be processed, and the preselection boxes f1-f4 are used to return their attribute feature vectors by using the embedding encoding algorithm, that is, the embedding values are a1. , A2, a3 and a4, if the Euclidean distance between vector a1 and vector a4 is less than the preset threshold, then vector a1 and vector a4 are considered to belong to the same detection object in A, B or C; if vector a1 and vector a2 The vector distance value between vector a1 and vector a3 and vector a2 and vector a3 is not less than the preset threshold, it is considered that the vectors a1, a2, and a3 do not belong to the same detection object. If the vector is still satisfied The vector distance value between a2 and vector a4 and vector a3 and vector a4 is not less than the preset threshold. It can be determined that vector a2 belongs to one of the detection objects of A, B or C; vector a3 belongs to different among A, B or C The detection target for a2 is also different from the detection target corresponding to the vector a1 and the vector a4. That is, the grouping result obtained may be: a1 and vector a4 belong to A, vector a2 belongs to B, and vector a3 belongs to C.

After determining the group to which each preselection box belongs in the at least one preselection box and obtaining at least one preselection box group, each preselection box group can be deduplicated to obtain the preselection box group after the deduplication processing; and based on The preselected box group after the deduplication process determines the target detection box of each detection object.

It can be seen from the above description that each preselection box group may include a visible box group and a complete box group. Based on this, step S208 performs deduplication processing on each preselection box group, and obtaining the preselection box after the deduplication processing includes: The visible frame group in a preselected frame group is deduplicated to obtain a visible frame group after the deduplication process. The visible frame group after the deduplication process may include a visible frame or a group of visible frames.

Step S210 determining the target detection frame of each detection object based on the preselected frame group after the deduplication processing includes: determining the target detection frame of each detection object based on the visible frame group after the deduplication processing and the complete frame group .

Specifically, in this embodiment, first, an image to be processed containing one or more detection objects is acquired; then, object detection is performed on the image to be processed to obtain at least one preselected box; then, each of the at least one preselected box is determined At least one preselection box group is obtained from the group to which each preselection box belongs; next, the visible box group in at least one preselection box group is deduplicated to obtain the visible box group after the deduplication processing; finally, based on the deduplication processing The visible frame group and the complete frame group determine the target detection frame of each detection object.

It can be seen from the above description that in the embodiments of the present application, the detection objects identified in the embodiments of the present application may be densely present in the image to be processed, resulting in a higher degree of coincidence of the complete frames of the detection objects. In order to reduce the complexity of deduplication , You can de-duplicate only the visible frame group in the preselected frame group. After that, the target detection frame of each detection object can be determined according to the visible frame group after deduplication and the complete frame group without deduplication.

Specifically, in this embodiment, the visible frame group after deduplication and the complete frame group without deduplication can be input into the R-CNN model for object detection, and then the target detection frame of each detection object can be obtained.

It should be noted that, in the embodiment of the present application, the visible frame group after deduplication and the complete frame group without deduplication are used as the input of the R-CNN model. When object detection is performed again, for the occluded object, only The visible frame group or the complete frame group is used as the input of the R-CNN model to improve the efficiency of detection, and the visible frame group and the complete frame group can also be used as the input of the R-CNN model together to improve the accuracy of detection. This embodiment There is no specific restriction on this.

Optionally, in this embodiment, the step of performing deduplication processing on the visible frame group in the at least one preselected frame group, and obtaining the visible frame group after the deduplication processing includes: using a non-maximum value suppression algorithm to At least one visible frame group in the preselected frame group is subjected to deduplication processing to obtain the visible frame group after the deduplication processing.

In the embodiment of this application, a non-maximum suppression algorithm (nms) is used to remove redundant preselection boxes from the preselection box group, and by setting the threshold value in the nms algorithm, the visible box group in the preselection box group is performed Duplicate processing. After the pre-selected frame group of each detection object is obtained, since each complete frame in the complete frame group has a high degree of overlap, the complete frame may not be deduplicated. Therefore, the nms algorithm is used to de-duplicate only the visible frame group, and the visible frame group after the de-duplication process is obtained. That is, in this embodiment, after the preselection frame group of the detection object is obtained, if the preselection frame group includes the visible frame group and the complete frame group, the visible frame group of the detection object can be deduplicated.

It should be noted that, refer to the schematic diagram of a visible frame and a complete frame that densely block similar objects as shown in FIG. 3. In Fig. 3, the first frame 1 and the third frame 3 on the left are complete frames of the occluded object P and the occluded object Q, respectively. Usually in the human body detection process of densely obscured people, the nms algorithm is used to de-duplicate only the preselection boxes of all detection objects of the same type, and it is impossible to distinguish and recognize the instances (different detection objects) well. Box 1 and The intersection ratio between the three boxes 3 is generally greater than the threshold preset in nms, which leads to two problems: if the threshold is too high, the repeated preselection boxes cannot be effectively removed; if the threshold is too low, it is easy The third frame 3 behind the occluded object Q is deleted, resulting in missed detection of the occluded object Q.

The same problem exists between box 5 and box 6 on the right. The dashed frame No. 2 frame 2 is the visible frame of the occluded object Q. It can be seen that the overlap between the second frame 2 of the visible part of the occluded object Q and the first frame 1 of the occluded object P is significantly smaller than the third frame 3. The degree of coincidence with the first frame 1, therefore, the occluded object P and the occluded object Q can be distinguished by the second frame 2, and the second frame 2 as the visible frame and the third frame 3 as the complete frame are bound , Becomes a pre-selected frame group, avoiding the redundant removal of the third frame 3 as the occlusion object P during the de-duplication process.

Through the visible frame group and the complete frame group after de-duplication processing, the calculation process can be simplified, the calculation speed and calculation accuracy of the R-CNN model can be improved, and a more accurate target detection frame can be obtained.

Optionally, in this embodiment, the step of determining the target detection frame of each detection object based on the visible frame group after the deduplication processing and the complete frame group includes:

Step S21, performing local feature alignment processing on each visible frame in the visible frame group after the deduplication processing; and performing local feature alignment processing on each complete frame in the complete frame group;

Step S22: Input the visible frame after the feature alignment processing and the complete frame after the feature alignment processing to the target detection model for detection processing to obtain the visible frame position coordinates and classification probability value after the feature alignment processing, and obtain the feature alignment The position coordinates and classification probability value of the complete box after processing;

Step S23: Determine the target detection frame of each detection object based on the target position coordinates and the target classification probability value, wherein the target position coordinates include: the visible frame position coordinates after the feature alignment processing and/or the feature alignment processing After the position coordinates of the complete frame, the target classification probability value includes: the classification probability value of the visible frame after the feature alignment processing and/or the classification probability value of the complete frame after the feature alignment processing.

In the embodiment of the present application, first, local feature alignment processing is performed on each visible frame in the visible frame group and each complete frame in the complete frame group. The purpose of the local feature alignment processing is to adjust each visible frame in the visible frame group and each complete frame in the complete frame group to the same size.

Optionally, the above-mentioned target detection model can be an R-CNN model. After performing local feature alignment processing on the visible frame group after de-duplication processing, and performing local feature alignment processing on the complete frame of the complete frame, the visible frame after the alignment process and the complete frame after the alignment process can be used to determine its corresponding The target detection frame of the detection object.

Optionally, the visible frame after the alignment process and/or the complete frame after the alignment process can be used as the input of the target detection model (such as the R-CNN model). After the detection process of the target detection model, each The coordinate position and classification probability value of the visible box, and the coordinate position and classification probability value of each complete box are obtained.

Since the detection object to which each visible frame or complete frame belongs has been determined, the visible frame or complete frame included in each detection object can be separately fused according to their target position coordinates and target classification probability value. The fused visible frame Or the fused complete frame is the target detection frame of the corresponding detection object. For a detection object that is not occluded, its target detection frame is its final complete frame, and the final complete frame is a detection frame obtained by fusing one or more complete frames; for an occluded detection object, its target detection frame is it The final complete frame and the final visible frame of, the final visible frame is a detection frame obtained by fusing one or more visible frames. Among them, for the occluded detection object, the complete frame and the visible frame are respectively merged to obtain the final complete frame and the final visible frame.

It should be noted that only the visible frame after the feature alignment process can be used as the input of the target detection model, or only the complete frame after the feature alignment process can be used as the input of the target detection model. The visible frame and the complete frame after the feature alignment process are used as the input of the target detection model, which is not specifically limited in this embodiment.

Optionally, in this embodiment, step S23, determining the target detection frame of each detection object based on the target position coordinates and the target classification probability value includes the following steps:

Step S231: Use the target classification probability value as the weight of the corresponding target position coordinate;

Step S232: Calculate a weighted average of the target position coordinates of each detection object according to the target classification probability value to obtain the target detection frame of the detection object; the target detection frame includes the final visible frame and/or the final complete frame.

In the embodiment of the present application, the target position coordinates of the visible frame indicate the corresponding position information of the visible frame in the image to be processed, and the target classification probability value of the visible frame indicates the evaluation of the detection processing result of the visible frame. The target position coordinates of the complete frame indicate the corresponding position information of the complete frame in the image to be processed, and the target classification probability value of the complete frame indicates the evaluation of the detection processing result of the complete frame. The higher the target classification probability value, the better the detection processing result of the visible frame or the complete frame. Therefore, to give it a higher weight, the target classification probability value can be used as the weight value to calculate the weighted average of the target position coordinates. The target detection frame of the object is obtained, and the target detection frame obtained by the weighted average method combines the comprehensive detection processing evaluation results of each visible frame or complete frame, and the position of the target detection frame is also closer to the actual position of the detection object.

It should be noted that the target detection frame is an accurate visible frame or an accurate complete frame of the final detected object. Among them, the precise visible frame is the smallest bounding frame that can accurately describe the maximum visible area of the occluded detection object.

Optionally, in this embodiment, if the feature pyramid includes multiple feature maps; performing local feature alignment processing on each visible frame in the visible frame group after the de-duplication processing includes the following steps:

Step S31, selecting a first target feature map in the feature pyramid;

Step S32: Perform feature cropping on the first target feature map in the feature pyramid based on each visible frame in the visible frame group after the de-duplication processing to obtain a first cropping result; perform feature cropping on the first cropping result Local feature alignment processing.

In the embodiment of the present application, the first target feature map refers to the feature map corresponding to the visible frame in the visible frame group in the feature pyramid. Since the feature pyramid contains feature maps of different scales, the feature maps of different scales are obtained by scaling the image to be processed in different proportions through the pyramid network.

After the first target feature map corresponding to the visible frame is determined, the visible frame can be scaled according to the scaling ratio of the first target feature map relative to the image to be processed, and the scaled first target feature map is determined The position of the visible frame is then obtained, and the feature and its position information in the first target feature map corresponding to the position are obtained as the first cropping result. Perform local feature alignment processing on the first cropping result, and input the first cropping result after the alignment processing into the target detection model for object detection.

It should be noted that the ROI Align module in Mask RCNN can be used to crop the features corresponding to the visible frame, and then the RCNN model can be used to perform further local feature alignment processing on the first cropping result.

Optionally, in this embodiment, if the feature pyramid includes multiple feature maps; performing local feature alignment processing on each complete frame in the complete frame group includes the following steps:

Step S41, selecting a second target feature map in the feature pyramid;

Step S42, performing feature cropping on the second target feature map in the feature pyramid based on each complete frame in the complete frame group to obtain a second cropping result;

Step S43: Perform local feature alignment processing on the second cropping result.

In the embodiment of the present application, the second target feature map refers to the feature map corresponding to the complete frame in the complete frame group in the feature pyramid. Since the feature pyramid contains feature maps of different scales, the feature maps of different scales are obtained by scaling the image to be processed in different proportions. After determining the second target feature map corresponding to the complete frame, the complete frame is based on the second target feature The image is scaled relative to the scale of the image to be processed, and the position of the scaled complete frame is determined in the second target feature map, and the feature and its position information in the second target feature map corresponding to the position are acquired, As the second cropping result. Before inputting the second cropping result into the target detection model, perform local feature alignment processing on the second cropping result.

It should be noted that the ROI Align module in Mask RCNN can be used to crop the features corresponding to the visible frame, and then the RCNN model can be used to perform further local feature alignment processing on the second cropping result.

In the embodiment of this application, compared with the existing object detection algorithm which only considers the object detection at the category level, the method provided by the embodiment of this application can distinguish and recognize the detected objects well. In the case of multiple objects, especially similar objects, In the case of occlusion, the visible frame and the complete frame are used as the regression target in the RPN stage. At the same time, for the generated pre-selection frame, according to its corresponding different detection objects, the hidden variable (embedding value) is distinguished, so as not only to distinguish between different The pre-selection box of the object of the category, and also distinguish the pre-selection box of different detection objects, and then use R-CNN to regress the deduplication results, and merge the regression results of different detection objects to obtain the final detection result, thus Realize the recognition of occluded objects under dense occlusion, avoiding missed detection of occluded objects.

Example three:

The embodiment of the present application also provides an object detection device, which is mainly configured to implement the object detection method provided in the above-mentioned content of the embodiment of the present application. The object detection device provided by the embodiment of the present application will be specifically introduced below.

Fig. 5 is a schematic diagram of an object detection device according to an embodiment of the present application. As shown in Fig. 5, the object detection device mainly includes an image acquisition unit 10, a preselection frame acquisition unit 20, a grouping unit 30, a deduplication unit 40, and a determination Unit 50, where:

The image acquisition unit 10 is configured to acquire a to-be-processed image containing one or more detection objects;

The pre-selected frame obtaining unit 20 is configured to perform object detection on the image to be processed to obtain at least one pre-selected frame, wherein the pre-selected frame includes a visible frame and/or a complete frame, and the complete frame is an overall detection object A bounding frame of, where the visible frame is a bounding frame of the visible area of each detection object in the image to be processed;

The grouping unit 30 is configured to determine the group to which each preselection box in the at least one preselection box belongs through the association modeling model to obtain at least one preselection box group; the preselection boxes in the same preselection box group belong to the same detection object;

The deduplication unit 40 is configured to perform deduplication processing on each preselection box group to obtain the preselection box group after the deduplication processing;

The determining unit 50 is configured to determine the target detection frame of each detection object based on the preselected frame group after the deduplication processing.

In the embodiment of the present application, the image to be processed containing one or more detection objects is first acquired, and then object detection is performed on the image to be processed to obtain at least one preselection box. Next, each preselection in the at least one preselection box is determined At least one preselection box group is obtained for the group to which the box belongs, and redundant preselection boxes are removed by deduplicating the preselection box group, and the preselection box group after deduplication is obtained, so that each preselection box group is determined based on the preselection box group after the deduplication processing. A target detection frame of a detection object, thereby realizing the detection of one or more detection objects in the image to be processed, effectively avoiding missed detection of the detection object.

Optionally, each of the pre-selected box groups includes a visible box group and a complete box group; the deduplication unit 40 is further configured to: perform de-duplication processing on the visible box group in the at least one pre-selected box group to obtain de-duplication The visible frame group after processing; determining the target detection frame of each detection object based on the preselected frame group after the de-duplication processing includes: determining each detection frame based on the visible frame group after the de-duplication processing and the complete frame group The target detection frame of the object.

Optionally, the preselection frame acquisition unit 20 is further configured to: input the to-be-processed image into a feature pyramid network for processing to obtain a feature pyramid; and use the regional candidate network RPN model to process the feature pyramid to obtain the At least one pre-selected box, wherein each pre-selected box in the at least one pre-selected box carries an attribute label configured to determine the type of each pre-selected box, and the type includes a complete box and a visible box.

Optionally, the grouping unit 30 determines the group to which each preselection box belongs to the at least one preselection box through the relevance modeling model, and obtaining at least one preselection group includes: projecting characteristics of the instance attributes of the relevance modeling model The network obtains the attribute feature vector of each preselection box in the at least one preselection box; the clustering module of the association modeling model determines the at least one preselection box based on the attribute feature vector of each preselection box The group to which each preselection box belongs obtains the at least one preselection box group.

Optionally, the instance attribute feature projection network is obtained through Lpull loss function and Lpush loss function training; wherein, the distance between the attribute feature vectors of the preselection boxes belonging to the same detection object is shortened through the Lpull loss function, and the Lpush loss function is used Extend the distance between the attribute feature vectors of the preselection boxes belonging to different detection objects.

Optionally, the grouping unit 30 calculates the vector distance value between any two of the attribute feature vectors through the clustering module of the association modeling model to obtain multiple vector distance values; The two pre-selected boxes that are smaller than the preset threshold in the group are added to the same group, and each other pre-selected box that is not added to the group is separately regarded as a group; at least one of the obtained groups is clustered and grouped by a clustering algorithm to obtain all Describes at least one preselection box group.

Optionally, the deduplication unit 40 is further configured to perform deduplication processing on the visible frame group in the at least one preselected frame group by using a non-maximum value suppression algorithm to obtain the visible frame group after the deduplication processing.

Optionally, the determining unit 50 is further configured to: perform local feature alignment processing on each visible frame in the visible frame group after the de-duplication processing; and perform local feature alignment on each complete frame in the complete frame group Processing; input the visible frame after the feature alignment processing and the complete frame after the feature alignment processing into the target detection model for detection processing, obtain the visible frame position coordinates and classification probability value after the feature alignment processing, and obtain the feature alignment processing The position coordinates and classification probability value of the complete frame afterwards; the target detection frame of each detection object is determined based on the target position coordinates and the target classification probability value, wherein the target position coordinates include: the visible frame position after the feature alignment processing The coordinates and/or the position coordinates of the complete frame after the feature alignment processing, the target classification probability value includes: the classification probability value of the visible frame after the feature alignment processing and/or the complete frame after the feature alignment processing The classification probability value of.

Optionally, the determining unit 50 is further configured to: use the target classification probability value as the weight of the corresponding target position coordinates; calculate a weighted average value for the target position coordinates of each detection object according to the target classification probability value , The target detection frame of the detection object is obtained; the target detection frame includes the final visible frame and/or the final complete frame.

Optionally, the feature pyramid includes a plurality of feature maps, and the determining unit 50 is further configured to: select a first target feature map in the feature pyramid; and based on each of the visible frame groups after the de-duplication processing. A visible frame performs feature cropping on the first target feature map in the feature pyramid to obtain a first cropping result; performing local feature alignment processing on the first cropping result.

Optionally, the feature pyramid includes a plurality of feature maps, and the determining unit 50 is further configured to: perform local feature alignment processing on each complete frame in the complete frame group, including: selecting a second feature map in the feature pyramid. Target feature map; perform feature cropping on the second target feature map in the feature pyramid based on each complete frame in the complete frame group to obtain a second cropping result; perform local feature alignment processing on the second cropping result.

The implementation principles and technical effects of the device provided in the embodiments of this application are the same as those of the foregoing method embodiments. For a brief description, for the parts not mentioned in the device embodiments, please refer to the corresponding content in the foregoing method embodiments.

In addition, in the description of the embodiments of the present application, unless otherwise clearly specified and limited, the terms "installed", "connected" and "connected" should be interpreted broadly, for example, they may be fixed or detachable connections. , Or integrally connected; it can be a mechanical connection or an electrical connection; it can be directly connected, or indirectly connected through an intermediate medium, and it can be the internal communication between two components. For those of ordinary skill in the art, the specific meanings of the above-mentioned terms in this application can be understood under specific circumstances.

In the description of this application, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner" and "outer", etc. The indicated orientation or positional relationship is based on the orientation or positional relationship shown in the drawings, and is only for the convenience of describing the application and simplifying the description, and does not indicate or imply that the pointed device or element must have a specific orientation or be in a specific orientation. The azimuth structure and operation cannot be understood as a limitation of the application. In addition, the terms "first", "second" and "third" are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance.

The computer program product of an object detection method provided by an embodiment of the present application includes a computer-readable storage medium storing non-volatile program code executable by a processor, and instructions included in the program code can be configured to execute For the specific implementation of the method described in the previous method embodiment, please refer to the method embodiment, which will not be repeated here.

Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the above-mentioned system, device and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of the units is only a logical function division, and there may be other divisions in actual implementation. For example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some communication interfaces, devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.

If the function is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a nonvolatile computer readable storage medium executable by a processor. Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code .

Finally, it should be noted that the above-mentioned embodiments are only specific implementations of the application, which are used to illustrate the technical solutions of the application, rather than limit it. The scope of protection of the application is not limited thereto, although referring to the foregoing The examples describe the application in detail, and those of ordinary skill in the art should understand that any person skilled in the art can still modify the technical solutions described in the foregoing examples within the technical scope disclosed in this application. Or it is easy to think of changes, or equivalent replacements of some of the technical features; and these modifications, changes or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be covered in this application Within the scope of protection. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

Claims

An object detection method, characterized by comprising:

Obtain a to-be-processed image containing one or more detection objects;

Perform object detection on the to-be-processed image to obtain at least one pre-selected box, wherein the pre-selected box includes a visible frame and/or a complete frame, the complete frame is an enclosing frame for an entire detection object, and the visible frame is The bounding frame of the visible area of each detection object in the image to be processed;

Determining the group to which each preselection box belongs in the at least one preselection box through an association modeling model to obtain at least one preselection box group; the preselection boxes in the same preselection box group belong to the same detection object;

Perform deduplication processing on each preselection box group to obtain the preselection box group after deduplication processing;

The target detection frame of each detection object is determined based on the preselected frame group after the deduplication processing.
The method according to claim 1, wherein determining the group to which each preselection box belongs to the at least one preselection box by using an association modeling model, and obtaining at least one preselection box group comprises:

Obtaining the attribute feature vector of each preselection box in the at least one preselection box through the instance attribute feature projection network of the association modeling model;

Through the clustering module of the association modeling model, the group to which each preselection box belongs to the at least one preselection box is determined based on the attribute feature vector of each preselection box to obtain the at least one preselection box group.
The method according to claim 2, wherein the instance attribute feature projection network is obtained through Lpull loss function and Lpush loss function training;

Among them, the distance between the attribute feature vectors of the preselection boxes belonging to the same detection object is shortened by the Lpull loss function, and the distance between the attribute feature vectors of the preselection boxes belonging to different detection objects is shortened by the Lpush loss function.
The method according to claim 2, wherein the clustering module of the association modeling model determines that each preselection box belongs to the at least one preselection box based on the attribute feature vector of each preselection box To obtain the at least one preselection box group includes:

Calculate the vector distance value between any two of the attribute feature vectors to obtain multiple vector distance values;

Adding two preselected boxes of the plurality of vector distance values that are less than a preset threshold value to the same group, and each other preselected box that is not added to the group as a separate group;

The at least one obtained group is clustered and grouped by a clustering algorithm to obtain the at least one preselected box group.
The method according to claim 1, wherein each of the preselection box groups includes a visible box group and a complete box group; performing deduplication processing on each preselection box group to obtain the preselection box after the deduplication processing includes: Performing de-duplication processing on the visible frame group in the at least one pre-selected frame group to obtain the visible frame group after the de-duplication processing;

Determining the target detection frame of each detection object based on the preselected frame group after the deduplication processing includes: determining the target detection frame of each detection object based on the visible frame group after the deduplication processing and the complete frame group.
The method according to claim 5, wherein performing de-duplication processing on the visible frame group in the at least one pre-selected frame group to obtain the visible frame group after the de-duplication processing comprises:

A non-maximum value suppression algorithm is used to perform deduplication processing on the visible frame group in the at least one preselected frame group to obtain the visible frame group after the deduplication process.
The method according to claim 6, wherein determining the target detection frame of each detection object based on the visible frame group after the deduplication processing and the complete frame group comprises:

Performing local feature alignment processing on each visible frame in the visible frame group after the de-duplication processing; and performing local feature alignment processing on each complete frame in the complete frame group;

The visible frame after the feature alignment processing and the complete frame after the feature alignment processing are input to the target detection model for detection processing to obtain the visible frame position coordinates and the classification probability value after the feature alignment processing, and to obtain the feature alignment processing The position coordinates and classification probability value of the complete box;

The target detection frame of each detection object is determined based on the target position coordinates and the target classification probability value, wherein the target position coordinates include: the visible frame position coordinates after the feature alignment processing and/or the completeness after the feature alignment processing The position coordinates of the frame, the target classification probability value includes: the classification probability value of the visible frame after the feature alignment processing and/or the classification probability value of the complete frame after the feature alignment processing.
The method according to claim 7, wherein the determining the target detection frame of each detection object based on the target position coordinates and the target classification probability value comprises:

Taking the target classification probability value as the weight of the corresponding target position coordinate;

Calculating a weighted average of the target position coordinates of each detection object according to the target classification probability value to obtain the target detection frame of the detection object;

The target detection frame includes a visible target frame and/or a complete target frame.
The method according to claim 1, wherein, performing object detection on the image to be processed to obtain at least one preselected box comprises:

Input the to-be-processed image into a feature pyramid network for processing to obtain a feature pyramid;

The feature pyramid is processed using the regional candidate network RPN model to obtain the at least one preselection box, wherein each preselection box in the at least one preselection box carries an attribute label, and the attribute label is configured to determine each The type of the preselected box, the type includes a complete box and a visible box.
An object detection device, characterized by comprising:

An image acquisition unit configured to acquire a to-be-processed image containing one or more detection objects;

The pre-selected frame acquisition unit is configured to perform object detection on the image to be processed to obtain at least one pre-selected frame, wherein the pre-selected frame includes a visible frame and/or a complete frame, and the complete frame is an overall view of a detected object A bounding frame, where the visible frame is a bounding frame of a visible area of each detection object in the image to be processed;

The grouping unit is configured to determine the group to which each preselection box in the at least one preselection box belongs through the association modeling model to obtain at least one preselection box group; the preselection boxes in the same preselection box group belong to the same detection object;

The deduplication unit is configured to perform deduplication processing on each preselection box group to obtain the preselection box group after the deduplication processing;

The determining unit is configured to determine the target detection frame of each detection object based on the preselected frame group after the deduplication processing.
An electronic device, comprising a memory and a processor, and a computer program that can run on the processor is stored in the memory, wherein the processor implements the above claims 1 to 9 when the computer program is executed. The steps of any one of the methods.
A computer-readable medium with non-volatile program code executable by a processor, wherein the program code causes the processor to execute the method described in any one of claims 1-9.