CN109948497B

CN109948497B - Object detection method and device and electronic equipment

Info

Publication number: CN109948497B
Application number: CN201910186133.5A
Authority: CN
Inventors: 李作新; 俞刚; 袁野
Original assignee: Beijing Kuangshi Technology Co Ltd
Current assignee: Beijing Kuangshi Technology Co Ltd
Priority date: 2019-03-12
Filing date: 2019-03-12
Publication date: 2022-01-28
Anticipated expiration: 2039-03-12
Also published as: WO2020181872A1; CN109948497A

Abstract

The invention provides an object detection method, an object detection device and electronic equipment, and relates to the technical field of image recognition, wherein the method comprises the following steps: acquiring an image to be processed containing one or more detection objects; carrying out object detection on an image to be processed to obtain at least one preselection frame, wherein the preselection frame comprises a visible frame and/or a complete frame, the complete frame is an enclosing frame for the whole detection object, and the visible frame is an enclosing frame of a visible area of each detection object in the image to be processed; determining the grouping of each preselected frame in the at least one preselected frame through a relevance modeling model to obtain at least one preselected frame group; preselection frames in the same preselection frame group belong to the same detection object; carrying out duplicate removal treatment on each preselected frame group to obtain a preselected frame group after the duplicate removal treatment; and determining a target detection frame of each detection object based on the pre-selection frame group after the deduplication processing. The invention can effectively avoid the omission of the detection object.

Description

Object detection method and device and electronic equipment

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to an object detection method and apparatus, and an electronic device.

Background

Object detection is one of the classical problems in computer vision, whose task is to mark the position of objects in an image with bounding boxes and to give the class of the object. From the traditional framework of artificially designing features and shallow classifiers to the end-to-end detection framework based on deep learning, object detection becomes more mature. At present, under the conditions that a plurality of objects, particularly similar objects, are densely arranged and shielding occurs between the objects, the existing object detection algorithm only considers object detection of class levels, so that the prior art cannot accurately detect the objects under the shielding condition. Under the condition that objects are mutually shielded, the method in the prior art often causes the problem that the shielded objects and the shielding objects cannot be effectively distinguished, so that the shielded objects are missed to be detected.

Disclosure of Invention

In view of the above, the present invention provides an object detection method, an object detection device and an electronic device, which solve the technical problem that similar objects are easy to miss detection when the objects are detected under the condition of dense shielding in the prior art.

In a first aspect, an embodiment of the present invention provides an object detection method, including: acquiring an image to be processed containing one or more detection objects; performing object detection on the image to be processed to obtain at least one preselection frame, wherein the preselection frame comprises a visible frame and/or a complete frame, the complete frame is an enclosing frame for the whole detection object, and the visible frame is an enclosing frame of a visible area of each detection object in the image to be processed; determining the grouping of each preselected frame in the at least one preselected frame through a relevance modeling model to obtain at least one preselected frame group; preselection frames in the same preselection frame group belong to the same detection object; carrying out duplicate removal treatment on each preselected frame group to obtain a preselected frame group after the duplicate removal treatment; and determining a target detection frame of each detection object based on the preselected frame group after the deduplication processing.

Further, determining the grouping to which each of the at least one preselected frame belongs through a relevance modeling model, and obtaining at least one preselected frame group includes: obtaining an attribute feature vector of each preselected frame in the at least one preselected frame through an example attribute feature projection network of the relevance modeling model; and determining the grouping of each preselected frame in the at least one preselected frame based on the attribute feature vector of each preselected frame through a clustering module of the relevance modeling model to obtain the at least one preselected frame group.

Further, the example attribute feature projection network is obtained through Lpull loss function and Lpush loss function training; the distance of the attribute feature vectors of the preselection frames belonging to the same detection object is shortened through an Lpull loss function, and the distance of the attribute feature vectors of the preselection frames belonging to different detection objects is lengthened through an Lpull loss function.

Further, determining, by the clustering module of the relevance modeling model, a grouping to which each preselected box of the at least one preselected box belongs based on the attribute feature vector of each preselected box, and obtaining the at least one preselected box group includes: calculating a vector distance value between any two attribute feature vectors to obtain a plurality of vector distance values; adding two preselected frames smaller than a preset threshold value in the vector distance values to the same group, wherein each other preselected frame which is not added to the group is independently used as a group; and clustering and grouping the obtained at least one group through a clustering algorithm to obtain the at least one preselected frame group.

Further, each of the pre-selected frame groups comprises a visible frame group and a complete frame group; carrying out deduplication processing on each preselected frame group, and obtaining preselected frames after the deduplication processing comprises the following steps: carrying out duplication removal processing on the visible frame group in the at least one preselected frame group to obtain a visible frame group after duplication removal processing; determining the target detection frame of each detection object based on the preselected frame group after the deduplication processing includes: and determining a target detection frame of each detection object based on the visible frame group and the complete frame group after the de-duplication processing.

Further, performing deduplication processing on the visible frame group in the at least one preselected frame group, and obtaining the visible frame group after the deduplication processing includes: and carrying out deduplication processing on the visible frame group in the at least one preselected frame group by using a non-maximum suppression algorithm to obtain the visible frame group after the deduplication processing.

Further, determining a target detection frame of each detection object based on the visible frame group and the complete frame group after the deduplication processing comprises: performing local feature alignment processing on each visible frame in the visible frame group after the deduplication processing; and each complete frame in the complete frame group is subjected to local feature alignment treatment; inputting the visible frame subjected to the feature alignment treatment and the complete frame subjected to the feature alignment treatment into a target object detection model for detection treatment, obtaining the position coordinate and the classification probability value of the visible frame subjected to the feature alignment treatment, and obtaining the position coordinate and the classification probability value of the complete frame subjected to the feature alignment treatment; determining a target detection frame of each detection object based on target position coordinates and a target classification probability value, wherein the target position coordinates comprise: the position coordinates of the visible frame after the feature alignment processing and/or the position coordinates of the complete frame after the feature alignment processing, and the target classification probability value include: a classification probability value of a visible box after the feature alignment process and/or a classification probability value of a full box after the feature alignment process.

Further, determining the target detection box of each detection object based on the target position coordinates and the target classification probability value includes: taking the target classification probability value as the weight of the corresponding target position coordinate; calculating a weighted average value according to the target classification probability value and the target position coordinate of each detection object to obtain a target detection frame of the detection object; the target detection frame comprises a target visible frame and/or a target complete frame.

Further, the object detection of the image to be processed to obtain at least one pre-selection frame includes: inputting the image to be processed into a feature pyramid network for processing to obtain a feature pyramid; and processing the feature pyramid by using a regional candidate network (RPN) model to obtain at least one preselected frame, wherein each preselected frame in the at least one preselected frame carries an attribute tag, the attribute tag is used for determining the type of each preselected frame, and the type comprises a complete frame and a visible frame.

In a second aspect, an embodiment of the present invention further provides an object detection apparatus, including: the image acquisition unit is used for acquiring an image to be processed containing one or more detection objects; a preselected frame acquiring unit, configured to perform object detection on the image to be processed to obtain at least one preselected frame, where the preselected frame includes a visible frame and/or a complete frame, the complete frame is an enclosing frame for an entire detection object, and the visible frame is an enclosing frame of a visible region of each detection object in the image to be processed; the grouping unit is used for determining the grouping of each preselected frame in the at least one preselected frame through a relevance modeling model to obtain at least one preselected frame group; preselection frames in the same preselection frame group belong to the same detection object; the duplication removing unit is used for carrying out duplication removing processing on each preselected frame group to obtain the preselected frame group after the duplication removing processing; and the determining unit is used for determining the target detection frame of each detection object based on the preselected frame group after the deduplication processing.

In a third aspect, an embodiment of the present invention further provides an electronic device, including a memory and a processor, where the memory stores a computer program operable on the processor, and the processor implements the steps of the method when executing the computer program.

In a fourth aspect, the present invention also provides a computer-readable medium having a non-volatile program code executable by a processor, where the program code causes the processor to execute the method described above.

In the embodiment of the invention, firstly, an image to be processed containing one or more detection objects is obtained; then, carrying out object detection on the image to be processed to obtain at least one preselection frame; next, the grouping of each preselected frame in the at least one preselected frame is determined, and at least one preselected frame group is obtained. The embodiment of the invention obtains at least one preselected frame group by determining the grouping of each preselected frame, and because the preselected frames in the same preselected frame group belong to the same detection object, the preselected frames belonging to different detection objects are distinguished by the preselected frame group, the preselected frames of the shielded object are prevented from being removed as redundant preselected frames of the shielded object in the de-duplication process, the technical problem that the similar object is easy to miss detection when the object detection is carried out under the condition of dense shielding of the object in the prior art is solved, the detection of one or more detection objects in the image to be processed is realized, and the purpose of missing detection of the detection object is effectively avoided.

Meanwhile, at least one preselected frame group is determined through the relevance modeling model, the relevance modeling model is realized through a neural network, after at least one preselected frame is input into the relevance modeling model, the preselected frames are grouped by fully utilizing the characteristic information of images in the preselected frames and the position information of the preselected frames, the preselected frames of different detection objects can be effectively distinguished, and particularly, the preselected frames which are adjacent in position and similar in size but belong to different detection objects can be accurately grouped under the condition that the overlapping degree of complete frames of an occlusion object and an occluded object is high in a dense object occlusion scene.

Additional features and advantages of the disclosure will be set forth in the description which follows, or in part may be learned by the practice of the above-described techniques of the disclosure, or may be learned by practice of the disclosure.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic diagram of an electronic device according to an embodiment of the invention;

FIG. 2 is a flow chart of a method of object detection according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a visible frame and a complete frame of a densely covered homogeneous object according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating a relationship between a pre-selection frame and a detection object according to an embodiment of the present invention;

fig. 5 is a schematic diagram of an object detection apparatus according to an embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The first embodiment is as follows:

first, an example electronic device 100 for implementing an object detection method of an embodiment of the present invention is described with reference to fig. 1.

As shown in FIG. 1, electronic device 100 includes one or more processors 102, one or more memory devices 104, an input device 106, an output device 108, and a camera 110, which are interconnected via a bus system 112 and/or other form of connection mechanism (not shown). It should be noted that the components and structure of the electronic device 100 shown in fig. 1 are exemplary only, and not limiting, and the electronic device may have other components and structures as desired.

The processor 102 may be implemented in at least one hardware form of a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), and an asic (application Specific Integrated circuit), the processor 102 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capability and/or instruction execution capability, and may control other components in the electronic device 100 to perform desired functions.

The storage 104 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. On which one or more computer program instructions may be stored that may be executed by processor 102 to implement client-side functionality (implemented by the processor) and/or other desired functionality in embodiments of the invention described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.

The input device 106 may be a device used by a user to input instructions and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like.

The output device 108 may output various information (e.g., images or sounds) to the outside (e.g., a user), and may include one or more of a display, a speaker, and the like.

The camera 110 is configured to acquire an image to be processed, where the image to be processed acquired by the camera is processed by the object detection method to obtain a target detection frame of a detection object, for example, the camera may capture an image (e.g., a photo, a video, etc.) desired by a user and then process the image by the object detection method to obtain the target detection frame of the detection object, and the camera may further store the captured image in the memory 104 for use by other components.

Exemplary electronic devices for implementing the object detection method according to embodiments of the present invention may be implemented on mobile terminals such as smart phones, tablet computers, and the like.

Example two:

in accordance with an embodiment of the present invention, there is provided an embodiment of an object detection method, it should be noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

Fig. 2 is a flow chart of an object detection method according to an embodiment of the present invention, as shown in fig. 2, the method including the steps of:

step S202, acquiring an image to be processed containing one or more detection objects.

In the embodiment of the present invention, the image to be processed may include various types of detection objects, for example, including human beings and non-human beings, where the non-human beings include dynamic objects and static objects, the dynamic objects may be objects of animal species, and the static objects may be objects in a static state other than human beings and animals.

In each image to be processed, a plurality of categories of objects may be included, and there may be one or more of each category of objects, for example, 2 persons and 3 dogs in the image. Various objects in the image to be processed can be displayed independently, and some objects may be blocked by other objects, so that the objects cannot be displayed completely.

It should be noted that the detection object may be one or more types of objects in the image to be processed, for which the object detection step is to be performed. The user may determine the type of the detection object according to actual needs, and this embodiment is not particularly limited.

It should be further noted that, in this embodiment, the image to be processed may be an image captured by a camera of the electronic device in the first embodiment, or may also be an image pre-stored in a memory of the electronic device, which is not specifically limited in this embodiment.

Step S204, performing object detection on the image to be processed to obtain at least one preselection frame, wherein the preselection frame comprises a visible frame and/or a complete frame, the complete frame is an enclosing frame for the whole detection object, and the visible frame is an enclosing frame of a visible area of each detection object in the image to be processed.

In the embodiment of the invention, after the image to be processed is acquired, object detection can be performed on the image to be processed through the pre-selection frame detection network. The process of performing object detection on the image to be processed is to perform object detection on a detection object which is not blocked in the image to be processed so as to output a complete frame, and the process can also be as follows: and carrying out object detection on the shielded object in the image to be processed, and simultaneously outputting the complete frame and the visible frame.

Multiple visible frames or multiple complete frames may be generated for the same detected object, and different visible frames or different complete frames may be scaled in different proportions with respect to the image to be processed.

Step S206, determining the grouping of each preselected frame in the at least one preselected frame through a relevance modeling model to obtain at least one preselected frame group; preselection frames in the same preselection frame group belong to the same detection object;

in the embodiment of the invention, after object detection, a plurality of preselected frames are respectively generated for different detection objects, wherein the preselected frames comprise visible frames and/or complete frames. Generally, the preselected frames contained in the detection result are redundant, and a deduplication process needs to be performed, in order to prevent the preselected frames of the occluded object from being removed as redundant preselected frames of the occluded object in the deduplication process, a group to which each preselected frame belongs needs to be determined, a preselected frame group is obtained according to correspondence of one group, and at least one preselected frame group can be obtained, so that the preselected frames belonging to different detection objects are distinguished through the preselected frame group. The relevance modeling model is a model capable of obtaining the relevance relation of input data and can be realized by a neural network, and after at least one pre-selection frame is input into the relevance modeling model, the relevance modeling model can effectively group the pre-selection frames according to the characteristic information of images in the pre-selection frames and by combining the position information of the pre-selection frames.

The preselection frames belonging to the same detection object can be grouped into a preselection frame group by grouping at least one preselection frame through the mode, and the preselection frame group of the same detection object can simultaneously comprise a visible frame and a complete frame, and the preselection frame group of the detection object can also simultaneously comprise a visible frame group and a complete frame group.

It should be noted that, as shown in fig. 4, a schematic diagram of a corresponding relationship between a preselection frame and a detection object, in the diagram, the detection object includes an occlusion object P and an occluded object Q occluded by the occlusion object P, and the preselection frame includes a frame No. seven, a frame No. 7, and a frame No. twelve, and a frame No. 12. The seven frame 7, the eight frame 8 and the nine frame 9 all belong to an occluded object P in the figure, and the ten frame 10, the eleventh frame 11 and the twelfth frame 12 all belong to an occluded object Q in the figure. The seven frame 7, the eight frame 8 and the nine frame 9 constitute one pre-selection frame group, and the ten frame 10, the eleven frame 11 and the twelve frame 12 constitute another pre-selection frame group.

After the group to which each preselection frame belongs in the seven frames 7 to the twelve frames 12 is obtained and the preselection frame groups are obtained, the preselection frames in each preselection frame group can be subjected to duplication removal respectively, so that the confusion of frames among different objects is prevented from occurring under the condition that the overlapping degree of the preselection frames of different objects is high, the preselection frame (for example, the ten frame 10) of the shielded object Q is prevented from being removed as a redundant preselection frame of the shielded object P in the duplication removal process, and the probability of missed detection of the shielded object is greatly reduced.

Step S208, carrying out duplicate removal processing on each preselected frame group to obtain a preselected frame group after the duplicate removal processing;

in the embodiment of the invention, after the object to which the preselected frame belongs is determined, the preselected frame group of each detection object is subjected to duplication removal respectively, and through grouping duplication removal, the preselected frames of different detection objects are prevented from being mixed up, specifically, the preselected frame of the shielded object is prevented from being removed as a redundant preselected frame of the shielded object in the duplication removal process, and further the problem of missed detection of the shielded object is avoided.

Step S210, determining a target detection frame of each detection object based on the preselected frame group after the deduplication processing.

In the embodiment of the present invention, after the preselected frame group after the deduplication processing is obtained, the target detection frame of each detection object may be determined based on the preselected frame group after the deduplication processing. If the detection object is not shielded in the image to be processed, the target detection frame of the detection object comprises a target complete frame; and if the detection object is blocked in the image to be processed, the target detection frame of the detection object comprises a target complete frame and a target visible frame. The target complete frame can be used for obtaining position information of a detection object and image characteristic information of the detection object except for an occluded object; the target visible frame can be used for obtaining the image characteristic information of the shielded object, and two types of target detection frames can be obtained according to the embodiment of the invention, so that more comprehensive and more accurate information of the detected object can be obtained for subsequent image processing such as identification and verification.

In the embodiment of the present invention, the steps S202 to S210 may be executed by a processor in the electronic device in the first embodiment.

It should be noted that, the processor capable of executing the steps S202 to S210 may be applied to the embodiment of the present invention, and is not limited in particular.

In addition, under the dense object shielding scene, the overlapping degree of the complete frames of the shielding object and the shielded object is high, the complete frames of different detection objects cannot be effectively distinguished only through the information such as the position, the size and the like of the complete frames, the grouping effect is poor, and then the complete frames cannot be effectively de-overlapped. In the embodiment of the invention, the relevance modeling model is realized by a neural network, at least one preselection frame is input into the relevance modeling model, the preselection frames are effectively grouped by utilizing the characteristic information of the image in the preselection frame and the position information of the preselection frame, the preselection frames of different detection objects can be effectively distinguished, and especially, the preselection frames which are adjacent in position, similar in size and belong to different detection objects can be accurately grouped under the condition that the overlapping degree of the complete frames of the shielding object and the shielded object in a dense object shielding scene is high.

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings.

As is apparent from the above description, in the present embodiment, an image to be processed including one or more detection objects is first acquired. Then, object detection can be performed on the image to be processed to obtain at least one pre-selection frame.

In an alternative embodiment, in step S204, the object detection on the image to be processed to obtain at least one pre-selection frame includes the following steps:

step S2041, inputting the image to be processed into a feature pyramid network for processing to obtain a feature pyramid;

step S2042, processing the feature pyramid by using a region candidate network rpn (region pro social networks) model to obtain at least one preselected frame, where each preselected frame in the at least one preselected frame carries an attribute tag, and the attribute tag is used to determine a type to which each preselected frame belongs, where the type includes a complete frame and a visible frame.

As can be seen from the above description, in the embodiment of the present invention, the feature pyramid network is used to generate a feature pyramid. Basic network models such as VGG (visual Geometry group) 16 model, Resnet or FPN (feature Pyramid networks) can be selected as the feature Pyramid network. In this embodiment, the image to be processed may be input to the feature pyramid network for processing, so as to obtain the feature pyramid.

Before processing the feature pyramid by using a region candidate network (RPN) (region pro-social networks) model, the region candidate network RPN model needs to be trained by a preset training set. Wherein, including a plurality of training samples in the preset training set, every training sample includes: training images and their corresponding image labels. Wherein the image label is used for marking the type of the pre-selected frame in the training image, and the type comprises a complete frame or a visible frame. The invention can train the RPN model by using a plurality of training samples, so that the RPN model can identify and identify the preselected frame type in the image.

After the basic network model and the regional candidate network RPN model are trained by using the preset training set, the feature pyramid can be processed by using the trained regional candidate network RPN model to obtain at least one preselected frame and an attribute label of each preselected frame, wherein the attribute label is used for representing whether the preselected frame is a visible frame or a complete frame.

Specifically, the attribute tag may be represented as "1" or "2", for example, "1" indicates that the preselected frame is a visible frame and "2" indicates that the preselected frame is a complete frame. Besides "1" and "2", other machine-recognizable data may be selected as the attribute tag, which is not specifically limited in this embodiment.

In this embodiment, a more accurate preselected frame detection result can be obtained by processing the feature pyramid through the RPN model of the regional candidate network.

After obtaining more accurate detection results of the preselected frames, the grouping to which each preselected frame in the at least one preselected frame belongs can be determined, so as to obtain at least one preselected frame group.

In an alternative embodiment, in step S206, determining the group to which each of the at least one preselected frame belongs through the relevance modeling model, and obtaining at least one preselected frame group includes the following steps:

step S11, obtaining the attribute feature vector of each preselected frame in the at least one preselected frame through the example attribute feature projection network of the relevance modeling model;

step S12, determining, by the clustering module of the relevance modeling model, a group to which each preselected frame in the at least one preselected frame belongs based on the attribute feature vector of each preselected frame, to obtain the at least one preselected frame group.

In the embodiment of the present invention, the association modeling model may be an Associate embedding model. The example attribute feature projection network in the relevance modeling model may be an embedding encoding network. Inputting at least one preselected box into an embedding encoding network in the relevance modeling model, and regressing a corresponding attribute feature vector for each preselected box, wherein each preselected box corresponds to one attribute feature vector. And then, the pre-selection frames of the same detection object are divided into the same group through a clustering module according to the attribute feature vector, and different groups correspond to different detection objects.

Before the Association embedding model is used for determining the grouping to which each pre-selection frame belongs, the embedding encoding network in the Association embedding model needs to be trained. To determine what kind of attribute feature vectors are output by the embedding encoding network. In the training process, the constraint condition of the attribute feature vectors is a distance between the attribute feature vectors, which may be an euclidean distance, a cosine distance, or the like. The distance of the attribute feature vectors of the preselection frames belonging to the same detection object can be shortened through a first constraint condition, and the preselection frames belonging to the same detection object are added to the same group through the attribute feature vectors; and the distance of the attribute feature vectors of the preselected frames belonging to different detection objects is increased through a second constraint condition, and the preselected frames belonging to different detection objects are added to different groups through the attribute feature vectors. Specifically, the first constraint may be an Lpull penalty function, and the second constraint may be an Lpush penalty function. The distance zooming-in training can be performed on the embedding encoding network through an Lpull loss function, and then the distance zooming-out training can be performed on the embedding encoding network through the Lpull loss function; the embedding encoding network can also be trained by utilizing the Lpull loss function and the Lpush loss function at the same time.

The Lpull loss function is expressed as:

where M is the number of attribute feature vectors, e_k、e_jAll represent arbitrary attribute feature vectors, C_mIndicates the corresponding examinationMeasuring the number of attribute feature vectors corresponding to the object; the Lpush loss function is shaped as:

where M is the number of attribute feature vectors, e_k、e_jEach represents an arbitrary attribute feature vector and,

representing a preset distance value.

After the imbedding encoding network training is completed and the preselected frames are obtained through the regional candidate network RPN model, the attribute feature vectors of the preselected frames are obtained through the imbedding encoding network, and an imbedding value is obtained. The embedding value may be an N-dimensional vector, which is obtained for each pre-selected frame, and may be represented as:

。

in the embodiment of the present invention, the purpose of obtaining the attribute feature vector is to distinguish different object instances (instances, that is, detection objects) in the preselection frame, the feature vector needs to have a distinguishing capability at an instance level, and can distinguish each detection object, not only a distinguishing capability at a category level (distinguish the types of the detection objects), so that there is a certain requirement on the selection of the feature extraction network, and the attribute feature vector embedding value obtained by the instance attribute feature projection network has a good distinguishing capability at an instance level.

In addition, the generation of the attribute feature vector embedding encoding is directly obtained by optimizing according to the incidence relation of the actual preselected frame by using the Association incidence modeling model with grouping relation, and is directly optimized according to the preselected frame grouping task, so that more direct and good performance improvement can be obtained.

Further, the example attribute feature projection network is realized by a neural network, and can be fused with a detection network (such as a feature pyramid network and a regional candidate network RPN) of a preselected frame, and the two networks share basic features of the network, so that the calculation amount is reduced. In addition, the method can be directly combined with the example attribute feature projection network in the process of training the detection network of the preselection frame, so that the joint training of the two integral networks is realized, other external information is not required to be added, and the training process is simpler.

Further, after the N-dimensional vectors are obtained, whether two different preselected frames belong to the same group or not can be judged by comparing the magnitudes of euclidean distances between the N-dimensional vectors of the two different preselected frames, that is, whether the two different preselected frames belong to the same detection object or not is determined.

The magnitude of the euclidean distance between two N-dimensional vectors can be determined by setting a preset threshold. For example, for a preset threshold value x, if the euclidean distance between the N-dimensional vectors of two different preselected boxes is less than x, the distance between the two preselected boxes is considered to be small, and they are considered to belong to the same group.

For other pre-selection frames, the groups to which the pre-selection frames belong are determined in the manner described above, and are not described one by one here.

Through the processing mode, the detection object to which each pre-selection frame belongs can be accurately determined, so that the probability of missed detection of the detection object is further reduced.

Optionally, the clustering module of the relevance modeling model determines the grouping to which each preselected frame in the at least one preselected frame belongs based on the attribute feature vector of each preselected frame to obtain the at least one preselected frame group, and the following embodiments may be implemented:

step S1, calculating vector distance values between any two attribute feature vectors to obtain a plurality of vector distance values;

step S2, adding two pre-selected frames smaller than a preset threshold value in the vector distance values to the same group, and taking each other pre-selected frame not added to the group as a group;

and step S3, clustering and grouping the obtained at least one group through a clustering algorithm to obtain the at least one preselected frame group.

In the embodiment of the invention, the attribute feature vectors are obtained by regression on all the preselected frames by using the above embedding encoding network, the vector distance values between any two attribute feature vectors are respectively calculated, and the vector distance values can be calculated by an Euclidean distance equidistance calculation method.

And then, comparing the obtained vector distance values with a preset threshold value, where the size of the preset threshold value may be determined according to actual needs or according to experience, and this embodiment is not specifically limited to this. If the vector distance value is smaller than the preset threshold value, the vector distance value can be determined to be a target vector distance value, and two pre-selection frames corresponding to the target vector distance value are considered to correspond to the same detection object, so that the two pre-selection frames corresponding to the target vector distance value are added to the same group. And respectively and independently taking the preselection frames corresponding to the attribute feature vectors of which the vector distance values with other attribute feature vectors are not less than a preset threshold value as a group. Thus, at least one packet may be obtained.

It should be noted that if there are two sets of preselected frames corresponding to two different target vector distance values, the same preselected frame, i.e., two different target vector distance values correspond to three different preselected frames, the three different preselected frames may be added to the same group.

After obtaining the at least one group, clustering the obtained at least one group by a clustering algorithm.

It should be noted that the clustering algorithm may be a commonly used algorithm, and for example, may be a K-means clustering algorithm (K-means) or a mean shift clustering algorithm, etc.

For example, there are preselection frames f1 to f8, and four test objects A, B, C and D among the objects to be processed. And (4) respectively using an embedding encoding algorithm to regress attribute feature vectors, namely embedding values, of the preselection boxes f1-f 8. Respectively calculating vector distance values between any two attribute feature vectors, and screening target vector distance values smaller than a preset threshold value from a plurality of vector distance values to be s1-s4 in sequence, wherein s1 is the vector distance value between preselected frames f1 and f2, s2 is the vector distance value between preselected frames f2 and f3, s3 is the vector distance value between preselected frames f4 and f5, and s4 is the vector distance value between preselected frames f5 and f 8. According to the information, pre-selection frames of numbers f1 and f2 corresponding to the target vector distance value s1 are added to the same group, pre-selection frames of numbers f2 and f3 corresponding to the target vector distance value s2 are added to the same group, because the pre-selection frames of numbers f1 and f2 are already in the same group, and the pre-selection frames of numbers f2 and f3 are also in the same group, so the pre-selection frames of numbers f1, f2 and f3 are in the same group, and similarly, the pre-selection frames of numbers f4, f5 and f8 are in the same group. Since the vector distance value between the attribute feature vector of the preselection boxes of f6 # and f7 # and any feature vector is not less than the preset threshold, the preselection boxes of f6 # and f7 # are respectively taken as one group. The grouping result comprises four groups in total, wherein one group comprises preselection boxes with numbers f1, f2 and f 3; one grouping includes preselection boxes numbered f4, f5, and f 8; one packet includes preselection box f 6; one packet includes a preselection box number f 7. And clustering and grouping are carried out according to the obtained 4 groups, so that 4 preselected frame groups can be obtained.

For another example, an image to be processed includes three detection objects, namely f1-f4 preselection boxes and A, B and C, and the preselection boxes for f1-f4 are respectively regressed by using an embedding encoding algorithm to obtain attribute feature vectors, namely embedding values are respectively a1, a2, a3 and a4, and if the Euclidean distance between the vector a1 and the vector a4 is smaller than a preset threshold, the vector a1 and the vector a4 are considered to belong to the same detection object in A, B or C; if the vector distance values between the vector a1 and the vector a2, the vector a1 and the vector a3, and the vector a2 and the vector a3 are not less than the preset threshold, the vectors a1, a2 and a3 are considered to be not the same detection object, and if the vector distance values between the vector a2 and the vector a4, and the vector distance values between the vector a3 and the vector a4 are not less than the preset threshold, the vector a2 can be determined to belong to one detection object of A, B or C; the vector a3 belongs to a detection object different from a2 in A, B or C, and also different from a detection object corresponding to the vector a1 and the vector a 4. That is, the resulting grouping result may be: a1 and vector a4 belong to a, vector a2 belongs to B, and vector a3 belongs to C.

After determining the grouping of each preselected frame in the at least one preselected frame to obtain at least one preselected frame group, performing deduplication processing on each preselected frame group to obtain a preselected frame group after deduplication processing; and determining a target detection frame of each detection object based on the preselected frame group after the deduplication processing.

As can be seen from the above description, each of the preselected frame groups may include a visible frame group and a complete frame group, and based on this, the step S208 performs a deduplication processing on each of the preselected frame groups, and obtaining the preselected frames after the deduplication processing includes: and performing deduplication processing on the visible frame group in the at least one preselected frame group to obtain a visible frame group after deduplication processing, wherein the visible frame group after deduplication processing may include one visible frame or a group of visible frames.

The step S210 of determining the target detection frame of each detection object based on the preselected frame group after the deduplication processing includes: and determining a target detection frame of each detection object based on the visible frame group and the complete frame group after the de-duplication processing.

Specifically, in the present embodiment, first, an image to be processed including one or more detection objects is acquired; then, carrying out object detection on the image to be processed to obtain at least one preselection frame; then, determining the grouping of each preselected frame in the at least one preselected frame to obtain at least one preselected frame group; next, performing duplicate removal processing on the visible frame group in at least one preselected frame group to obtain a visible frame group after the duplicate removal processing; finally, a target detection frame of each detection object is determined based on the visible frame group and the complete frame group after the deduplication processing.

As can be seen from the above description, in the embodiment of the present invention, since the detection objects identified by the embodiment of the present invention may densely exist in the image to be processed, and thus the coincidence degree of the complete frames of the detection objects is high, in order to reduce the complexity of the deduplication, only the visible frame group in the preselected frame group may be subjected to the deduplication processing. Then, the target detection frame of each detection object can be determined according to the visible frame group after the duplication removal and the complete frame group without the duplication removal.

Specifically, in this embodiment, the visible frame group after the deduplication and the complete frame group without the deduplication may be input into the R-CNN model for object detection, so as to obtain a target detection frame of each detection object.

It should be noted that, in the embodiment of the present invention, when object detection is performed again according to the input of the R-CNN model, which is the visible frame group after the duplication removal and the complete frame group without the duplication removal, for an occluded object, only the visible frame group or the complete frame group may be used as the input of the R-CNN model, so as to improve detection efficiency, and the visible frame group and the complete frame group may also be used as the input of the R-CNN model together, so as to improve detection accuracy, which is not specifically limited in this embodiment.

Optionally, in this embodiment, the step of performing deduplication processing on the visible frame group in the at least one preselected frame group to obtain a visible frame group after the deduplication processing includes: and carrying out deduplication processing on the visible frame group in the at least one preselected frame group by using a non-maximum suppression algorithm to obtain the visible frame group after the deduplication processing.

In the embodiment of the present invention, a non-maximum suppression algorithm (nms) is used to remove redundant preselected frames from the preselected frame group, and the visible frame group in the preselected frame group is subjected to deduplication processing by setting a threshold in the nms algorithm. After the preselected frame group of each detection object is obtained, the complete frames are not subjected to de-duplication treatment because the coincidence degree of each complete frame in the complete frame group is high. Therefore, only the visible frame group is subjected to the deduplication processing by using the nms algorithm, and the visible frame group after the deduplication processing is obtained. That is, in this embodiment, after obtaining the preselected frame group of the detection object, if the preselected frame group includes the visible frame group and the complete frame group, the visible frame group of the detection object may be subjected to the deduplication processing.

It should be noted that, refer to a schematic diagram of a visible frame and a complete frame for densely blocking the same kind of object shown in fig. 3. In fig. 3, a frame 1 and a frame 3 on the left side are complete frames of an occlusion object P and an occluded object Q, respectively. In the human body detection process of dense occlusion people, the use of the nms algorithm only performs deduplication on preselected frames of all detection objects of the same kind, and cannot perform good distinction and cognition on examples (different detection objects), and the intersection ratio between the first frame 1 and the third frame 3 is generally greater than a threshold value preset in nms, which leads to two problems: if the threshold value is too high, the repeated preselection frame cannot be effectively deduplicated; if the threshold is too low, the third frame 3 of the subsequent occluded object Q is easily deleted, which results in missed detection of the occluded object Q.

The same problem occurs between the frame 5 on the right and the frame 6 on the right. The overlap ratio of the second frame 2 of the visible part of the shielded object Q and the first frame 1 of the shielded object P is obviously smaller than the overlap ratio of the third frame 3 and the first frame 1, so that the shielded object P and the shielded object Q can be distinguished through the second frame 2, the second frame 2 serving as the visible frame and the third frame 3 serving as the complete frame are bound to form a preselected frame group, and the situation that the third frame 3 is removed as the redundancy of the shielded object P in the de-duplication process is avoided.

Through the visible frame group and the complete frame group after the re-processing, the calculation process can be simplified, the calculation speed and the calculation accuracy of the R-CNN model can be improved, and therefore a more accurate target detection frame can be obtained.

Optionally, in this embodiment, the step of determining the target detection frame of each detection object based on the visible frame group and the complete frame group after the deduplication processing includes:

step S21, local feature alignment processing is carried out on each visible frame in the visible frame group after the deduplication processing; and each complete frame in the complete frame group is subjected to local feature alignment treatment;

step S22, inputting the visible frame after the feature alignment processing and the complete frame after the feature alignment processing into a target object detection model for detection processing to obtain the position coordinate and the classification probability value of the visible frame after the feature alignment processing and obtain the position coordinate and the classification probability value of the complete frame after the feature alignment processing;

step S23, determining a target detection box of each detection object based on the target position coordinates and the target classification probability value, wherein the target position coordinates include: the position coordinates of the visible frame after the feature alignment processing and/or the position coordinates of the complete frame after the feature alignment processing, and the target classification probability value include: a classification probability value of a visible box after the feature alignment process and/or a classification probability value of a full box after the feature alignment process.

In the embodiment of the invention, firstly, each visible frame in the visible frame group and each complete frame in the complete frame group are subjected to local feature alignment processing. The purpose of the local feature alignment process is to resize each visible frame in the set of visible frames and each full frame in the set of full frames to the same size.

Alternatively, the target detection model may be selected from an R-CNN model. After the local feature alignment processing is performed on the visible frame group after the deduplication processing, and the local feature alignment processing is performed on the complete frame in the complete frame, the target detection frame of the detection object corresponding to the visible frame after the alignment processing and the complete frame after the alignment processing can be determined by using the visible frame after the alignment processing and the complete frame after the alignment processing.

Alternatively, the visible boxes after the alignment process and/or the complete boxes after the alignment process may be used as input of a target object detection model (e.g., an R-CNN model), and after the detection process of the target object detection model, the coordinate position and the classification probability value of each visible box and the coordinate position and the classification probability value of each complete box are obtained respectively.

Because the detection object to which each visible frame or complete frame belongs is determined, the visible frames or complete frames included in each detection object can be respectively fused according to the target position coordinates and the target classification probability values, and the fused visible frames or the fused complete frames are the target detection frames of the corresponding detection objects. For the detection object which is not shielded, the target detection frame is the final complete frame of the detection object, and the final complete frame is one or more detection frames obtained by fusing the complete frames; for the occluded detection object, the target detection frame is the final complete frame and the final visible frame of the occluded detection object, and the final visible frame is a detection frame obtained by fusing one or more visible frames. And respectively fusing the complete frame and the visible frame of the shielded detection object to obtain a final complete frame and a final visible frame.

It should be noted that, only the visible frame after the feature alignment process may be used as the input of the target detection model, only the complete frame after the feature alignment process may be used as the input of the target detection model, and both the visible frame after the feature alignment process and the complete frame after the feature alignment process may be used as the input of the target detection model, which is not particularly limited in this embodiment.

Optionally, in this embodiment, in step S23, the determining the target detection box of each detection object based on the target position coordinates and the target classification probability value includes the following steps:

step S231, using the target classification probability value as the weight of the corresponding target position coordinate;

step S232, calculating a weighted average value of the target position coordinates of each detection object according to the target classification probability values to obtain a target detection frame of the detection object; the target detection frame comprises a final visible frame and/or a final complete frame.

In the embodiment of the invention, the target position coordinates of the visible frame represent corresponding position information of the visible frame in the image to be processed, and the target classification probability value of the visible frame represents evaluation of a detection processing result of the visible frame. The target position coordinates of the complete frame represent corresponding position information of the complete frame in the image to be processed, and the target classification probability value of the complete frame represents evaluation of a detection processing result of the complete frame. The higher the target classification probability value is, the better the detection processing result of the visible frame or the complete frame is, so that the higher weight is given to the target classification probability value, the target classification probability value can be used as a weight value, a weighted average value is calculated for the coordinates of the target position, the target detection frame of the object is obtained, the comprehensive detection processing evaluation result of each visible frame or the complete frame is fused by the target detection frame obtained by a weighted average value method, and the position of the target detection frame is closer to the actual position condition of the detection object.

It should be noted that the target detection frame is a precisely visible frame or a precisely complete frame of the final detection object. Wherein, the accurate visible frame is a minimum surrounding frame which can accurately describe the maximum visible area of the occluded detection object.

Optionally, in this embodiment, if the feature pyramid includes a plurality of feature maps; then performing a local feature alignment process on each visible frame in the visible frame group after the deduplication process includes the following steps:

step S31, selecting a first target feature map in the feature pyramid;

step S32, feature clipping is carried out on the first target feature map in the feature pyramid based on each visible frame in the visible frame group after the duplication elimination processing, and a first clipping result is obtained; and carrying out local feature alignment processing on the first cutting result.

In this embodiment of the present invention, the first target feature map refers to a feature map corresponding to a visible frame in the visible frame group in the feature pyramid. The characteristic pyramid comprises characteristic graphs with different scales, and the characteristic graphs with different scales are obtained by scaling the images to be processed in different proportions through the pyramid network.

After determining the first target feature map corresponding to the visible frame, the visible frame may be scaled according to a scaling ratio of the first target feature map with respect to the image to be processed, and a position of the scaled visible frame may be determined in the first target feature map. And further acquiring the feature and the position information thereof in the first target feature map corresponding to the position as a first cutting result. And performing local feature alignment processing on the first cutting result, and inputting the first cutting result after the alignment processing into a target object detection model for object detection.

It should be noted that the feature corresponding to the visible frame may be cut out by using an ROI Align module in Mask RCNN, and then the first cutting result may be subjected to further local feature alignment processing by using an RCNN model.

Optionally, in this embodiment, if the feature pyramid includes a plurality of feature maps; the local feature alignment processing of each complete frame in the complete frame group comprises the following steps:

step S41, selecting a second target feature map in the feature pyramid;

step S42, feature clipping is carried out on the second target feature graph in the feature pyramid based on each complete frame in the complete frame group, and a second clipping result is obtained;

and step S43, performing local feature alignment processing on the second clipping result.

In this embodiment of the present invention, the second target feature map refers to a feature map corresponding to a complete frame in the complete frame group in the feature pyramid. The feature pyramid comprises feature graphs with different scales, the feature graphs with different scales are obtained by scaling the image to be processed in different proportions, after a second target feature graph corresponding to the complete frame is determined, the complete frame is scaled according to the scaling proportion of the second target feature graph relative to the image to be processed, the position of the scaled complete frame is determined in the second target feature graph, and the feature and the position information of the feature in the second target feature graph corresponding to the position are obtained and used as a second cutting result. And performing local feature alignment processing on the second cutting result before inputting the second cutting result into the target object detection model.

It should be noted that the feature corresponding to the visible frame may be cut out by using an ROI Align module in Mask RCNN, and then the second cutting result may be subjected to further local feature alignment processing by using an RCNN model.

Compared with the existing object detection algorithm which only considers the object detection of the category hierarchy, the method provided by the embodiment of the invention can well distinguish and recognize the detection object, under the condition that multiple objects, particularly similar objects, are densely appeared and occlusion is generated, a visible frame and a complete frame are used as regression targets in an RPN stage, and simultaneously, for the generated preselection frame, according to the corresponding different detection objects, carrying out hidden variable (embedding value) discrimination, thereby not only distinguishing the preselection frames of different types of objects, but also distinguishing the preselection frames of different detection objects, then the R-CNN is used for carrying out regression on the duplicate removal result again, the regression results of different detection objects are subjected to frame fusion to obtain the final detection result, therefore, the identification of the shielded object under the condition of dense shielding is realized, and the missed detection of the shielded object is avoided.

Example three:

the embodiment of the present invention further provides an object detection apparatus, which is mainly used for executing the object detection method provided by the foregoing content of the embodiment of the present invention, and the object detection apparatus provided by the embodiment of the present invention is specifically described below.

Fig. 5 is a schematic diagram of an object detection apparatus according to an embodiment of the present invention, as shown in fig. 5, the object detection apparatus mainly includes an image acquisition unit 10, a preselected frame acquisition unit, 20, a grouping unit 30, a deduplication unit 40, and a determination unit 50, wherein:

an image acquisition unit 10 for acquiring an image to be processed containing one or more detection objects;

a preselected frame acquiring unit 20, configured to perform object detection on the image to be processed to obtain at least one preselected frame, where the preselected frame includes a visible frame and/or a complete frame, the complete frame is an enclosing frame of an entire detection object, and the visible frame is an enclosing frame of a visible region of each detection object in the image to be processed;

the grouping unit 30 is used for determining the grouping of each preselected frame in the at least one preselected frame through a relevance modeling model to obtain at least one preselected frame group; preselection frames in the same preselection frame group belong to the same detection object;

the duplicate removal unit 40 is configured to perform duplicate removal processing on each preselected frame group to obtain a preselected frame group after the duplicate removal processing;

a determining unit 50, configured to determine a target detection frame of each detection object based on the preselected frame group after the deduplication processing.

In the embodiment of the invention, an image to be processed containing one or more detection objects is firstly obtained, then, object detection is carried out on the image to be processed to obtain at least one preselection frame, next, the grouping of each preselection frame in the at least one preselection frame is determined to obtain at least one preselection frame group, the preselection frame group after the duplication is removed by carrying out duplication removal processing on the preselection frame group to remove redundant preselection frames, so that a target detection frame of each detection object is determined based on the preselection frame group after the duplication removal processing, further, the detection of one or more detection objects in the image to be processed is realized, and the omission of the detection objects is effectively avoided.

Optionally, each of the pre-selected frame groups comprises a visible frame group and a complete frame group; the deduplication unit 40 is further configured to: carrying out duplication removal processing on the visible frame group in the at least one preselected frame group to obtain a visible frame group after duplication removal processing; determining the target detection frame of each detection object based on the preselected frame group after the deduplication processing includes: and determining a target detection frame of each detection object based on the visible frame group and the complete frame group after the de-duplication processing.

Optionally, the pre-selection frame acquiring unit 20 is further configured to: inputting the image to be processed into a feature pyramid network for processing to obtain a feature pyramid; and processing the feature pyramid by using a regional candidate network (RPN) model to obtain at least one preselected frame, wherein each preselected frame in the at least one preselected frame carries an attribute tag, the attribute tag is used for determining the type of each preselected frame, and the type comprises a complete frame and a visible frame.

Optionally, the grouping unit 30 determines, through the relevance modeling model, a grouping to which each preselected frame of the at least one preselected frame belongs, and obtaining at least one preselected frame group includes: obtaining an attribute feature vector of each preselected frame in the at least one preselected frame through an example attribute feature projection network of the relevance modeling model; and determining the grouping of each preselected frame in the at least one preselected frame based on the attribute feature vector of each preselected frame through a clustering module of the relevance modeling model to obtain the at least one preselected frame group.

Optionally, the example attribute feature projection network is obtained through an Lpull loss function and an Lpush loss function training; the distance of the attribute feature vectors of the preselection frames belonging to the same detection object is shortened through an Lpull loss function, and the distance of the attribute feature vectors of the preselection frames belonging to different detection objects is lengthened through an Lpull loss function.

Optionally, the grouping unit 30 calculates a vector distance value between any two attribute feature vectors through a clustering module of the relevance modeling model to obtain a plurality of vector distance values; adding two preselected frames smaller than a preset threshold value in the vector distance values to the same group, wherein each other preselected frame which is not added to the group is independently used as a group; and clustering and grouping the obtained at least one group through a clustering algorithm to obtain the at least one preselected frame group.

Optionally, the deduplication unit 40 is further configured to: and carrying out deduplication processing on the visible frame group in the at least one preselected frame group by using a non-maximum suppression algorithm to obtain the visible frame group after the deduplication processing.

Optionally, the determining unit 50 is further configured to: performing local feature alignment processing on each visible frame in the visible frame group after the deduplication processing; and each complete frame in the complete frame group is subjected to local feature alignment treatment; inputting the visible frame subjected to the feature alignment treatment and the complete frame subjected to the feature alignment treatment into a target object detection model for detection treatment, obtaining the position coordinate and the classification probability value of the visible frame subjected to the feature alignment treatment, and obtaining the position coordinate and the classification probability value of the complete frame subjected to the feature alignment treatment; determining a target detection frame of each detection object based on target position coordinates and a target classification probability value, wherein the target position coordinates comprise: the position coordinates of the visible frame after the feature alignment processing and/or the position coordinates of the complete frame after the feature alignment processing, and the target classification probability value include: a classification probability value of a visible box after the feature alignment process and/or a classification probability value of a full box after the feature alignment process.

Optionally, the determining unit 50 is further configured to: taking the target classification probability value as the weight of the corresponding target position coordinate; calculating a weighted average value according to the target classification probability value and the target position coordinate of each detection object to obtain a target detection frame of the detection object; the target detection frame comprises a final visible frame and/or a final complete frame.

Optionally, the feature pyramid includes a plurality of feature maps, and the determining unit 50 is further configured to: selecting a first target feature map in the feature pyramid; performing feature clipping on the first target feature map in the feature pyramid based on each visible frame in the visible frame group after the duplicate removal processing to obtain a first clipping result; and carrying out local feature alignment processing on the first cutting result.

Optionally, the feature pyramid includes a plurality of feature maps, and the determining unit 50 is further configured to: the local feature alignment processing on each complete frame in the complete frame group comprises the following steps: selecting a second target feature map in the feature pyramid; performing feature clipping on a second target feature map in the feature pyramid based on each complete frame in the complete frame group to obtain a second clipping result; and carrying out local feature alignment processing on the second cutting result.

The device provided by the embodiment of the present invention has the same implementation principle and technical effect as the method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the method embodiments without reference to the device embodiments.

In addition, in the description of the embodiments of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

The computer program product of the object detection method provided in the embodiments of the present invention includes a computer-readable storage medium storing a nonvolatile program code executable by a processor, where instructions included in the program code may be used to execute the method described in the foregoing method embodiments, and specific implementation may refer to the method embodiments, and will not be described herein again.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An object detection method, comprising:

acquiring an image to be processed containing one or more detection objects;

performing object detection on the image to be processed to obtain at least one preselection frame, wherein the preselection frame comprises a visible frame and/or a complete frame, the complete frame is an enclosing frame for the whole detection object, and the visible frame is an enclosing frame of a visible area of the detection object in the image to be processed;

determining the grouping of each preselected frame in the at least one preselected frame through a relevance modeling model to obtain at least one preselected frame group; preselection frames in the same preselection frame group belong to the same detection object; each pre-selected frame group comprises a visible frame group and a complete frame group; the relevance modeling model is realized by a neural network, and the grouping is determined according to the characteristic information of the images in the at least one pre-selected frame and the position information of the pre-selected frame;

carrying out duplication removal processing on the visible frame group in the at least one preselected frame group to obtain a visible frame group after duplication removal processing;

and determining a target detection frame of the detected object based on the target visible frame group and the target complete frame group after the de-duplication processing.

2. The method of claim 1, wherein determining the grouping to which each of the at least one preselected boxes belongs through a relevance modeling model, resulting in at least one preselected box group comprises:

obtaining an attribute feature vector of each preselected frame in the at least one preselected frame through an example attribute feature projection network of the relevance modeling model;

and determining the grouping of each preselected frame in the at least one preselected frame based on the attribute feature vector of each preselected frame through a clustering module of the relevance modeling model to obtain the at least one preselected frame group.

3. The method as claimed in claim 2, wherein the example attribute feature projection network is obtained by training of an Lpull loss function and an Lpush loss function;

the distance of the attribute feature vectors of the preselection frames belonging to the same detection object is shortened through an Lpull loss function, and the distance of the attribute feature vectors of the preselection frames belonging to different detection objects is lengthened through an Lpull loss function.

4. The method of claim 2, wherein determining, by the clustering module of the relevance modeling model, the group to which each of the at least one preselected box belongs based on the attribute feature vector of each of the at least one preselected boxes, resulting in the at least one preselected box group comprises:

calculating a vector distance value between any two attribute feature vectors to obtain a plurality of vector distance values;

adding two preselected frames smaller than a preset threshold value in the vector distance values to the same group, wherein each other preselected frame which is not added to the group is independently used as a group;

and clustering and grouping the obtained at least one group through a clustering algorithm to obtain the at least one preselected frame group.

5. The method of claim 1, wherein the step of performing deduplication processing on the set of visible frames in the at least one preselected set of frames to obtain a set of visible frames after deduplication processing comprises:

and carrying out deduplication processing on the visible frame group in the at least one preselected frame group by using a non-maximum suppression algorithm to obtain the visible frame group after the deduplication processing.

6. The method of claim 5, wherein determining the target detection box for each detected object based on the set of visible boxes and the set of full boxes after the deduplication process comprises:

performing local feature alignment processing on each visible frame in the visible frame group after the deduplication processing; and each complete frame in the complete frame group is subjected to local feature alignment treatment;

inputting the visible frame subjected to the feature alignment treatment and the complete frame subjected to the feature alignment treatment into a target object detection model for detection treatment, obtaining the position coordinate and the classification probability value of the visible frame subjected to the feature alignment treatment, and obtaining the position coordinate and the classification probability value of the complete frame subjected to the feature alignment treatment;

determining a target detection frame of each detection object based on target position coordinates and a target classification probability value, wherein the target position coordinates comprise: the position coordinates of the visible frame after the feature alignment processing and/or the position coordinates of the complete frame after the feature alignment processing, and the target classification probability value include: a classification probability value of a visible box after the feature alignment process and/or a classification probability value of a full box after the feature alignment process.

7. The method of claim 6, wherein determining the target detection box for each detected object based on the target location coordinates and the target classification probability value comprises:

taking the target classification probability value as the weight of the corresponding target position coordinate;

calculating a weighted average value according to the target classification probability value and the target position coordinate of each detection object to obtain a target detection frame of the detection object; the target detection frame comprises a target visible frame and/or a target complete frame.

8. The method of claim 1, wherein performing object detection on the image to be processed to obtain at least one preselected frame comprises:

inputting the image to be processed into a feature pyramid network for processing to obtain a feature pyramid;

and processing the feature pyramid by using a regional candidate network (RPN) model to obtain at least one preselected frame, wherein each preselected frame in the at least one preselected frame carries an attribute tag, the attribute tag is used for determining the type of each preselected frame, and the type comprises a complete frame and a visible frame.

9. An object detecting device, comprising:

the image acquisition unit is used for acquiring an image to be processed containing one or more detection objects;

a preselected frame acquiring unit, configured to perform object detection on the image to be processed to obtain at least one preselected frame, where the preselected frame includes a visible frame and/or a complete frame, the complete frame is an enclosing frame for an entire detection object, and the visible frame is an enclosing frame of a visible region of each detection object in the image to be processed;

the grouping unit is used for determining the grouping of each preselected frame in the at least one preselected frame through a relevance modeling model to obtain at least one preselected frame group; preselection frames in the same preselection frame group belong to the same detection object; each pre-selected frame group comprises a visible frame group and a complete frame group; the relevance modeling model is realized by a neural network, and the grouping is determined according to the characteristic information of the images in the at least one pre-selected frame and the position information of the pre-selected frame;

the duplication removing unit is used for carrying out duplication removing processing on the visible frame group in the at least one preselected frame group to obtain a visible frame group after the duplication removing processing;

and the determining unit is used for determining a target detection frame of the detection object based on the target visible frame group and the target complete frame group after the deduplication processing.

10. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method according to any of the preceding claims 1 to 8 when executing the computer program.

11. A computer-readable medium having non-volatile program code executable by a processor, wherein the program code causes the processor to perform the method of any of claims 1-8.