CN114373071A

CN114373071A - Target detection method and device and electronic equipment

Info

Publication number: CN114373071A
Application number: CN202111422593.7A
Authority: CN
Inventors: 乔李盟
Original assignee: Beijing Megvii Technology Co Ltd
Current assignee: Beijing Megvii Technology Co Ltd
Priority date: 2021-11-26
Filing date: 2021-11-26
Publication date: 2022-04-19

Abstract

The invention provides a target detection method, a device and electronic equipment, wherein the method is applied to the electronic equipment, a target detection model is prestored on the electronic equipment, and the target detection model comprises a first feature extractor, a gradient decoupling layer, a regional suggestion network layer and a regional convolution neural network layer, and the method comprises the following steps: performing feature extraction on an image to be detected through a first feature extractor to obtain basic features corresponding to the image to be detected; processing the basic features by using a first decoupling parameter and a second decoupling parameter through a gradient decoupling layer to obtain a first decoupling feature corresponding to the first decoupling parameter and a second decoupling feature corresponding to the second decoupling parameter; processing the first decoupling characteristic through a regional suggestion network layer to obtain a regional suggestion characteristic; and processing the second decoupling characteristic and the region suggestion characteristic through a region convolution neural network layer to obtain a detection result. The method effectively relieves the overfitting problem of the model.

Description

Target detection method and device and electronic equipment

Technical Field

The present invention relates to the field of image detection technologies, and in particular, to a target detection method and apparatus, and an electronic device.

Background

Object detection is one of important research subjects in the field of computer vision, and in recent years, has played an important role in security monitoring, automatic driving, and the like. The core tasks of the object detection technology are to find out all interested objects (objects) in the image, and determine the corresponding category (classification) and position (location) of each object, such as pedestrian detection, vehicle detection, etc. With the continuous development of deep learning technology and convolutional neural network, the target detection technology has made great progress, and the existing target detection technology is mainly divided into two categories: a Two-Stage Detector (Two-Stage Detector) and a single-Stage Detector (One-Stage Detector), wherein the Two-Stage Detector extracts a large number of pre-selected frames (Region probes) which may contain targets, and the pre-selected frames are subjected to target positioning and classification through a convolutional neural network; the single-stage detector directly predicts object class and position on the network extracted features. In both of the above two manners, a large number of samples with labeled information are required to support optimization and learning of the model, and when the available sample size is small, the model tends to be trapped in a serious overfitting risk, so that the model cannot be optimized well. Meanwhile, as the scale of the deep learning model becomes larger, the labeling pressure brought by a large number of sample labels also becomes larger, and therefore, the detection of small sample targets provides a solution for the problems.

The existing small sample target detection technology is mainly applied to the two-stage detector, and is mainly divided into two solutions based on Meta-Learning (Meta-Learning) and Transfer-Learning (Transfer-Learning). The solution based on meta-learning is generally in a form of organizing training data into a series of small sample detection tasks, and the ability of solving such small sample scene tasks is learned by using a meta-learning paradigm, but the solution is often accompanied with a complex training process and a data organization form, so that the scene adaptability is reduced and the pressure of model deployment is increased, and meanwhile, due to the dependency of the meta-learning on the form of the meta-task, the generalization of the model is limited. The transfer learning-based method is characterized in that pre-training is carried out on a large number of data sets with consistent labels, then fine tuning is directly carried out on a small sample data set, but in order to avoid an overfitting phenomenon caused by sample shortage, the method only updates part of parameter spaces in the whole network in a fine tuning stage, so that the learning capability of a model is reduced, meanwhile, an excessively simple learning strategy enables the model to be incapable of fully utilizing a small amount of labeled data, and the performance of the model cannot be guaranteed.

Disclosure of Invention

In view of the above, the present invention provides a target detection method, a target detection device and an electronic device, so as to solve the over-fitting problem of a model in a certain field.

In a first aspect, an embodiment of the present invention provides a target detection method, where the method is applied to an electronic device, and a target detection model is prestored on the electronic device, where the target detection model includes a first feature extractor, a gradient decoupling layer, a region-proposed network layer, and a region-convolutional neural network layer, and the gradient decoupling layer is configured with a first decoupling parameter corresponding to the region-proposed network layer and a second decoupling parameter corresponding to the region-convolutional neural network layer, and the method includes: performing feature extraction on an image to be detected through a first feature extractor to obtain basic features corresponding to the image to be detected; processing the basic features by using a first decoupling parameter and a second decoupling parameter through a gradient decoupling layer to obtain a first decoupling feature corresponding to the first decoupling parameter and a second decoupling feature corresponding to the second decoupling parameter; wherein the first and second decoupling parameters are different such that a similarity between the first decoupling characteristic and the base characteristic and a similarity between the second decoupling characteristic and the base characteristic are different; processing the first decoupling characteristic through a regional suggestion network layer to obtain a regional suggestion characteristic; processing the second decoupling characteristic and the regional suggestion characteristic through a regional convolutional neural network layer to obtain a detection result corresponding to the target object; the detection result comprises a target frame and a classification score corresponding to the target frame.

Further, the step of processing the basic features by the gradient decoupling layer by using the first decoupling parameter and the second decoupling parameter respectively to obtain a first decoupling feature corresponding to the first decoupling parameter and a second decoupling feature corresponding to the second decoupling parameter includes: inputting basic features into a gradient decoupling layer; multiplying the first decoupling parameter by the basic feature through a gradient decoupling layer to obtain a first product, and determining the first product as a first decoupling feature; and multiplying the second decoupling parameter by the basic characteristic through the gradient decoupling layer to obtain a second product, and determining the second product as a second decoupling characteristic.

Further, the target detection model is obtained by training through the following method: acquiring a first initial decoupling parameter and a second initial decoupling parameter; training an initial target detection model through a first type of sample image, and obtaining a first target detection model when a first model convergence condition is met; the type of the object contained in the first type sample image is different from that of the target object; training the first target detection model through the second type of sample image, and determining the first target detection model meeting the second model convergence condition as a target detection model; the type of the object contained in the second type sample image is the same as that of the target object, and the number of the first type sample images is larger than that of the second type sample images.

Further, the target detection model further comprises a second feature extractor connected with the regional convolutional neural network layer; after the step of obtaining the detection result corresponding to the target object, the method further includes: performing feature extraction on the second type sample image through a second feature extractor to obtain a second type sample image feature set; performing feature extraction on the image to be detected through a second feature extractor to obtain a prediction feature corresponding to the image to be detected; determining a correction classification score according to the second type sample image feature set and the prediction features; and correcting the classification score corresponding to the target frame according to the corrected classification score to obtain the corrected classification score.

Further, the target detection model further comprises another gradient decoupling layer; another gradient decoupling layer is positioned behind the regional convolutional neural network layer; the step of processing the second decoupling characteristic and the area suggestion characteristic through the area convolution neural network layer to obtain the detection result corresponding to the target object includes: processing the second decoupling characteristic and the regional suggestion characteristic through a regional convolutional neural network layer to obtain a regression characteristic and a classification characteristic; processing the regression features by using a third decoupling parameter through another gradient decoupling layer to obtain a target frame; processing the classification characteristics by using a fourth decoupling parameter through another gradient decoupling layer to obtain a classification score corresponding to the target frame; and determining the target frame and the classification score corresponding to the target frame as a detection result corresponding to the target object.

Further, the step of determining a corrected classification score according to the second type sample feature set and the predicted feature includes: dividing the second type sample characteristic set into a plurality of second type sample subsets according to the categories corresponding to the characteristics; wherein each second type sample subset corresponds to a type; determining a target category corresponding to the prediction feature; and determining a corrected classification score according to the second type sample subset corresponding to the target class and the prediction characteristics.

Further, the step of determining a corrected classification score according to the second type sample subset corresponding to the target class and the prediction features includes: determining subset category characteristics of a second type sample subset corresponding to the target category; and calculating the similarity of the subset category features and the predicted features, and determining the similarity as a corrected classification score.

Further, the similarity between the subset classification feature and the prediction feature is cosine similarity.

Further, the step of correcting the classification score corresponding to the target frame based on the corrected classification score to obtain a corrected classification score includes: and weighting and summing the corrected classification score and the classification score to obtain a sum value, and determining the sum value as the corrected classification score.

In a second aspect, an embodiment of the present invention further provides an electronic device, which includes a processor and a memory, where the memory stores computer-executable instructions that can be executed by the processor, and the processor executes the computer-executable instructions to implement the object detection method in the first aspect.

In a third aspect, embodiments of the present invention also provide a computer-readable storage medium, where computer-executable instructions are stored, and when being called and executed by a processor, the computer-executable instructions cause the processor to implement the object detection method of the first aspect.

In a fourth aspect, an embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program, and when the computer program is executed by a processor, the object detection method of the first aspect is implemented.

According to the target detection method, the target detection device and the electronic equipment, provided by the embodiment of the invention, the first feature extractor is used for extracting the features of the image to be detected to obtain the basic features corresponding to the image to be detected; processing the basic features by using a first decoupling parameter and a second decoupling parameter through a gradient decoupling layer to obtain a first decoupling feature corresponding to the first decoupling parameter and a second decoupling feature corresponding to the second decoupling parameter; processing the first decoupling characteristic through a regional suggestion network layer to obtain a regional suggestion characteristic; and processing the second decoupling characteristic and the area suggestion characteristic through an area convolution neural network layer to obtain a detection result corresponding to the object. According to the embodiment of the invention, the extracted basic features are decoupled before being input into the area suggestion network layer and the area convolution neural network layer, and the decoupling features and the area suggestion network and the similarity between the decoupling features and the area convolution neural network are controlled through different decoupling parameters.

Additional features and advantages of the disclosure will be set forth in the description which follows, or in part may be learned by the practice of the above-described techniques of the disclosure, or may be learned by practice of the disclosure.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic structural diagram of an electronic system according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a target detection model according to an embodiment of the present invention;

fig. 3 is a schematic flowchart of a target detection method according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of another object detection model provided in the embodiment of the present invention;

FIG. 5 is a schematic flow chart illustrating another method for detecting a target according to an embodiment of the present invention;

fig. 6 is a schematic flowchart of a training method of a target detection model according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a gradient decoupling layer according to an embodiment of the present invention;

FIG. 8 is a schematic diagram illustrating an operation principle of a target detection model according to an embodiment of the present invention;

fig. 9 is a schematic diagram illustrating an operation principle of a calibration module according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of an object detection apparatus according to an embodiment of the present invention;

FIG. 11 is a schematic view of another target detection apparatus according to an embodiment of the present invention;

fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In recent years, technical research based on artificial intelligence, such as computer vision, deep learning, machine learning, image processing, and image recognition, has been actively developed. Artificial Intelligence (AI) is an emerging scientific technology for studying and developing theories, methods, techniques and application systems for simulating and extending human Intelligence. The artificial intelligence subject is a comprehensive subject and relates to various technical categories such as chips, big data, cloud computing, internet of things, distributed storage, deep learning, machine learning and neural networks. Computer vision is used as an important branch of artificial intelligence, particularly a machine is used for identifying the world, and the computer vision technology generally comprises the technologies of face identification, living body detection, fingerprint identification and anti-counterfeiting verification, biological feature identification, face detection, pedestrian detection, target detection, pedestrian identification, image processing, image identification, image semantic understanding, image retrieval, character identification, video processing, video content identification, behavior identification, three-dimensional reconstruction, virtual reality, augmented reality, synchronous positioning and map construction (SLAM), computational photography, robot navigation and positioning and the like. With the research and progress of artificial intelligence technology, the technology is applied to various fields, such as security, city management, traffic management, building management, park management, face passage, face attendance, logistics management, warehouse management, robots, intelligent marketing, computational photography, mobile phone images, cloud services, smart homes, wearable equipment, unmanned driving, automatic driving, smart medical treatment, face payment, face unlocking, fingerprint unlocking, testimony verification, smart screens, smart televisions, cameras, mobile internet, live webcasts, beauty treatment, medical beauty treatment, intelligent temperature measurement and the like.

Based on the fact that the target detection is difficult to effectively perform in the current target detection method when the sample data size of the unknown domain is small and the target is migrated from the known domain to the unknown domain, the embodiments of the present invention provide a target detection method, an apparatus and an electronic device, which can solve the over-fitting problem of target detection in a certain field.

Referring to fig. 1, a schematic diagram of an electronic system 100 is shown. The electronic system can be used for realizing the target detection method and device of the embodiment of the invention.

As shown in fig. 1, an electronic system 100 includes one or more processing devices 102 and one or more memory devices 104. Optionally, electronic system 100 may also include input device 106, output device 108, and one or more image capture devices 110, which may be interconnected via a bus system 112 and/or other form of connection mechanism (not shown). It should be noted that the components and structure of the electronic system 100 shown in fig. 1 are exemplary only, and not limiting, and the electronic system may have some of the components in fig. 1, as well as other components and structures, as desired.

The processing device 102 may be a server, a smart terminal, or a device containing a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, may process data for other components in the electronic system 100, and may control other components in the electronic system 100 to perform target detection functions.

Storage 104 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, Random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, Read Only Memory (ROM), a hard disk, flash memory, and the like. One or more computer program instructions may be stored on a computer-readable storage medium and executed by processing device 102 to implement the client functionality (implemented by the processing device) of the embodiments of the invention described below and/or other desired functionality. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.

The input device 106 may be a device used by a user to input instructions and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like.

The output device 108 may output various information (e.g., images or sounds) to the outside (e.g., a user), and may include one or more of a display, a speaker, and the like.

Image acquisition device 110 may acquire an image to be detected and store the image to be detected in storage 104 for use by other components.

For example, the devices for implementing the object detection method, apparatus and electronic device according to the embodiment of the present invention may be integrally disposed, or may be disposed in a distributed manner, such as integrally disposing the processing device 102, the storage device 104, the input device 106 and the output device 108, and disposing the image capturing device 110 at a designated position where an image can be captured. When the above-described devices in the electronic system are integrally provided, the electronic system may be implemented as an intelligent terminal such as a camera, a smart phone, a tablet computer, a vehicle-mounted terminal, and the like.

Fig. 2 is a target detection model provided in an embodiment of the present application, where the model includes a first feature extractor, a gradient decoupling layer, a region suggestion network layer, and a region convolution neural network layer, where the gradient decoupling layer is connected to the region suggestion network layer and the convolution neural network layer, respectively, and is configured with a first decoupling parameter corresponding to the region suggestion network layer and a second decoupling parameter corresponding to the region convolution neural network layer, and based on the target detection model shown in fig. 2, the embodiment of the present application provides a target detection method, see fig. 3, where the method is applied to an electronic device, and the electronic device has the target detection model shown in fig. 2 prestored therein, and the method includes the following steps:

s302: performing feature extraction on an image to be detected through a first feature extractor to obtain basic features corresponding to the image to be detected;

the image to be detected is an image containing a target object, and it is understood that the image to be detected may be one image or a sequence of images containing a plurality of images.

The input of the Feature Extractor (Feature Extractor) is the image to be detected, and the basic features of the image are output through the module. In some possible embodiments, the module may be a convolutional neural network, and the high-dimensional depth features are extracted through the convolutional neural network.

S304: processing the basic features by using a first decoupling parameter and a second decoupling parameter through a gradient decoupling layer to obtain a first decoupling feature corresponding to the first decoupling parameter and a second decoupling feature corresponding to the second decoupling parameter; wherein the first and second decoupling parameters are different such that a similarity between the first decoupling characteristic and the base characteristic and a similarity between the second decoupling characteristic and the base characteristic are different;

in the forward propagation process of the target detection model, the basic features pass through a gradient decoupling layer, and the gradient decoupling layer respectively processes the basic features by using a first decoupling parameter and a second decoupling parameter to obtain a first decoupling feature corresponding to the first decoupling parameter and also obtain a second decoupling feature corresponding to the second decoupling parameter. It should be noted that the first decoupling parameter may be one or a parameter set formed by a plurality of parameters, and similarly, the second decoupling parameter may be one or a parameter set formed by a plurality of parameters.

The first decoupling parameter and the second decoupling parameter are not equal and are both larger than zero, the first decoupling parameter and the second decoupling parameter are determined by training a target detection model by utilizing a first type sample image and a second type sample image, the number of the first type sample images is larger than that of the second type sample images, and the second type sample images contain objects of the same type as the detection result. The method for determining the decoupling parameter will be described in detail later, and will not be described herein again.

The decoupling transformation may be in particular a scaling transformation, for example a linear transformation y ═ Ax + b, the parameters a and b constituting the decoupling parameters, the combination of a and b for the first decoupling parameter being different from the combination of a and b for the second decoupling parameter. Through different decoupling parameters, the decoupling of the basic features of the image and the regional suggestion network layer and the decoupling of the basic features of the image and the regional convolution neural network layer are realized.

It should be noted that the gradient decoupling layer may be one layer, and outputs a first decoupling feature and a second decoupling feature corresponding to the first decoupling parameter and the second decoupling parameter, or may be two gradient decoupling layers, which are connected to the regional proposed network layer and the regional convolutional neural network layer, respectively, that is, the first gradient decoupling layer is connected to the regional proposed network layer, the second gradient decoupling layer is connected to the regional convolutional neural network layer, the first gradient decoupling layer is configured with the first decoupling parameter, and the second gradient decoupling layer is configured with the second decoupling parameter.

S306: processing the first decoupling characteristic through a regional suggestion network layer to obtain a regional suggestion characteristic;

inputting a first decoupling characteristic corresponding to the first decoupling parameter into a Region suggestion Network (RPN), and outputting a Region characteristic of the image to be detected, wherein the Region characteristic can be characterized as a Region candidate frame, and specifically can be represented in a rectangular frame manner. The specific RPN working process is that an initial anchor point is set on the feature map, for example, a pixel is defined, then a priori preselected frame is set based on the initial anchor point, further, the preselected frame is classified and regressed to be transformed, and the preselected frame is transformed into a region candidate frame which can reflect the actual target size and position.

It can be understood that there is not one target to be detected in the image, and therefore, the basic feature passes through the module, and a plurality of region candidate frames are output, and each region candidate frame contains classification information and regression information (i.e., position information). The classification information only classifies the foreground/background, the preset foreground is 1, the background is 0, and if a certain area candidate frame is the foreground, the classification information of the area candidate frame is 1.

S308: processing the second decoupling characteristic and the regional suggestion characteristic through a regional convolutional neural network layer to obtain a detection result corresponding to the target object; the detection result comprises a target frame and a classification score corresponding to the target frame.

The input of a Region Convolution neural Network layer (RCNN) is a depth feature and a Region candidate box, and the Region candidate box and the depth feature are adjusted by the RCNN according to a target task, so that an output detection result is adapted to an actual task. The specific structure and working principle of the RCNN may be a structure of a general RCNN network, which is not limited in the embodiment of the present invention.

The output after the RCNN adjustment also includes two branches, one is a multi-class classification branch (i.e., classification score) and the other is a target position regression branch (i.e., target frame), and the target classification score and the target frame together form a detection result.

According to the target detection method provided by the embodiment of the invention, the first feature extractor is used for extracting the features of the image to be detected to obtain the basic features corresponding to the image to be detected; processing the basic features by using a first decoupling parameter and a second decoupling parameter through a gradient decoupling layer to obtain a first decoupling feature corresponding to the first decoupling parameter and a second decoupling feature corresponding to the second decoupling parameter; processing the first decoupling characteristic through a regional suggestion network layer to obtain a regional suggestion characteristic; and processing the second decoupling characteristic and the region suggestion characteristic through a region convolution neural network layer to obtain a detection result. According to the embodiment of the invention, the extracted image features are decoupled before being input into the regional suggestion network layer and the regional convolution neural network layer, the image features and the regional suggestion network and the similarity degree between the image features and the regional convolution neural network are controlled through different decoupling parameters, and compared with a traditional target detection method without decoupling processing, the method can effectively relieve the over-fitting problem of a target detection model in a certain field, and further improve the target detection precision.

In some possible embodiments, the first and second decoupling characteristics described above may be determined by:

(1) inputting basic features into a gradient decoupling layer;

(2) multiplying the first decoupling parameter by the basic feature through a gradient decoupling layer to obtain a first product, and determining the first product as a first decoupling feature;

(3) and multiplying the second decoupling parameter by the basic characteristic through the gradient decoupling layer to obtain a second product, and determining the second product as a second decoupling characteristic.

The target detection method provided by the embodiment of the invention can be used for carrying out target detection on the image to be detected to obtain the detection result with the target object, the small sample is limited by the number of the sample per se, the training of the model is not thorough, and the accuracy of the final target detection is further improved.

In some possible embodiments, the first decoupling parameter and the second decoupling parameter are determined by training, and specifically, the target detection model may be trained by using two different types of sample images, which are not referred to as a first type sample image and a second type sample image, where an object included in the first type sample image is different from a type of the target object. For example, if the target object is a tiger, the first type sample image may contain a cat, a mouse, and the like, and the second type sample image may contain objects of the same type as the target object, that is, each of the second type sample images contains a tiger. In addition, in an actual application scenario, it may be difficult to obtain a specific type of image, so that the first type of sample image is often relatively easier to obtain than the second type of sample image, and based on this, the number of the first type of sample images may be set to be greater than the number of the second type of sample images. The target detection model is trained through the two types of sample images, and a first decoupling parameter and a second decoupling parameter which are fully trained and matched with the target object can be obtained. The specific model training process will be described below, and will not be described herein.

On the basis of the above object detection model, another object detection model is provided in the embodiments of the present invention, as shown in fig. 4, the model further includes a second feature extractor connected to the above regional convolutional neural network layer. Based on the target detection model shown in fig. 4, another target detection method is provided in the embodiments of the present invention, as shown in fig. 5, the method focuses on further correcting the target classification score obtained by the target detection model, and the method includes the following steps:

s502: performing feature extraction on an image to be detected through a first feature extractor to obtain basic features corresponding to the image to be detected;

s504: processing the basic features by using a first decoupling parameter and a second decoupling parameter through a gradient decoupling layer to obtain a first decoupling feature corresponding to the first decoupling parameter and a second decoupling feature corresponding to the second decoupling parameter;

s506: processing the first decoupling characteristic through a regional suggestion network layer to obtain a regional suggestion characteristic;

s508: processing the second decoupling characteristic and the regional suggestion characteristic through a regional convolutional neural network layer to obtain a detection result corresponding to the target object; the detection result comprises a target frame and a classification score corresponding to the target frame.

Steps S502 to S508 are the same as steps S202 to S208 in the above embodiment, and are not described herein again.

S510: performing feature extraction on the second type sample image through a second feature extractor to obtain a second type sample image feature set;

the second feature extractor may be a feature extractor obtained by pre-training, and specifically may be a general neural network model, such as an ImageNet pre-training model, or may be another neural network model.

S512: performing feature extraction on the image to be detected through a second feature extractor to obtain a prediction feature corresponding to the image to be detected;

s514: determining a correction classification score according to the second type sample image feature set and the prediction features;

s516: and correcting the classification score corresponding to the target frame according to the corrected classification score to obtain the corrected classification score.

Specifically, the corrected classification score and the classification score may be weighted and summed to obtain a sum, and the sum may be determined as the corrected classification score. For example, the classification score may be fused with the corrected classification score in a weighted form as follows:

wherein S is_iA representative of the classification score is provided,

representing the corrected classification score, alpha representing the weight of the classification score, and the result

I.e. the corrected classification score.

The embodiment of the invention introduces the correction classification score with relatively high quality based on metric learning to correct and adjust the target classification score with relatively low quality, and the correction classification score is obtained based on a more universal model, so the correction classification score has generalization, the problem of relatively poor generalization of the detection result extracted by the target detection model due to rare samples in small sample detection is solved, and the performance of the target detection model and the precision of the output result are further improved.

In some possible embodiments, the object detection model may further include another gradient decoupling layer; another gradient decoupling layer is positioned behind the regional convolutional neural network layer; the process of determining the detection result corresponding to the target object may specifically be:

processing the second decoupling characteristic and the regional suggestion characteristic through a regional convolutional neural network layer to obtain a regression characteristic and a classification characteristic; processing the regression features by using a third decoupling parameter through another gradient decoupling layer to obtain a target frame; and processing the classification characteristics by using a fourth decoupling parameter through another gradient decoupling layer to obtain a classification score corresponding to the target frame.

In some possible embodiments, the corrected classification score in step S514 may be determined as follows:

(1) dividing the second type sample characteristic set into a plurality of second type sample subsets according to the categories corresponding to the characteristics; wherein each second type sample subset corresponds to a type;

it is understood that the number of the second type sample images is smaller than that of the first type sample images, and we do not refer to the second type sample images as small sample images, and a small sample image is a set of images, which includes a plurality of sample images, and a plurality of different types of features can be extracted by performing feature extraction based on the small sample images. For the target detection task, only the features corresponding to the sample images of the same type as the target object to be detected are useful features. Based on this, the embodiment of the present invention first divides the small sample feature set into a plurality of small sample feature subsets according to the category information, where each subset corresponds to the same category. For example, when the small sample image includes 10 sample images, and features are extracted, and the features are found to correspond to two types, namely, a northeast tiger and a panda, the features of the northeast tiger are divided into one feature subset, and the features corresponding to the panda are divided into another subset.

(2) Determining a target category corresponding to the prediction feature;

continuing with the above example, if the present detection is performed to detect the position of the northeast tiger in the image, the following operation may be performed using only the feature subset of the northeast tiger.

(3) And determining a corrected classification score according to the second type sample subset corresponding to the target class and the prediction characteristics.

Specifically, the subset category features of the second type sample subset corresponding to the target category may be determined first, and the similarity between the subset category features and the predicted features may be calculated, and the similarity may be determined as the corrected classification score.

The feature extractor extracts features of the small sample data according to the input small sample image, and it should be noted that the small sample image is a small sample image for training, that is, a small sample training image, and the subset category features are calculated according to the category label corresponding to the features. For example, the detection target is the northeast tiger in the detection image, the categories included in the small sample training image include the northeast tiger and the panda, and then in the small sample training image, each sample image is labeled with the category label of the northeast tiger (for example, indicated by 1) and the category label of the panda (for example, indicated by 2). For the images of the same category, the corresponding features are feature subsets of the small sample images, and the mean value of the classification scores of the feature subsets is calculated, which may be referred to as the subset category features of the feature subsets, and the subset category features of the feature subsets may represent the features common to the whole category. The specific formula for calculating the subset category features of the feature subset is as follows:

the above formula is used to determine all sample features x belonging to class c given_iAnd y_iThe sum of all features is calculated and then averaged. Wherein S is_cS in (a) is a set, i.e. the set of all labeled samples contained in the category c; x is the number of_iAnd y_iRepresenting a certain feature vector in the sample set.

Further, after the subset category feature of the second type sample subset is calculated, the similarity between the subset category feature and the prediction feature may be calculated, for example, the cosine similarity between the subset category feature and the prediction feature may be calculated, and the similarity may be determined as the corrected classification score. Specifically, the cosine similarity may be calculated using the following formula:

wherein x is_iRepresents a certain feature vector; p is a radical of_ciAnd representing a prototype vector corresponding to the class c, namely the subset class features calculated by the method, wherein the double vertical lines represent modular length of the features.

Fig. 6 is a method for training a target detection model according to an embodiment of the present invention, where the method includes the following steps:

s602: acquiring a first initial decoupling parameter and a second initial decoupling parameter;

s604: training an initial target detection model through a first type of sample image, and obtaining a first target detection model when a first model convergence condition is met; the type of the object contained in the first type sample image is different from that of the target object;

the initial target detection model is a neural network model in which initial parameters are set before training is started, and the initial parameters may be set empirically.

S606: training the first target detection model through the second type of sample image, and determining the first target detection model meeting the second model convergence condition as a target detection model; the type of the object contained in the second type sample image is the same as that of the target object, and the number of the first type sample images is larger than that of the second type sample images.

The training process of the whole model is divided into two stages, wherein the first stage is a first-stage training performed on a large amount of labeled data from a known domain, namely a first-class sample, a basic model with target detection capability, namely an initial target detection model, can be obtained through the first-stage training, and the model can complete a basic target detection task.

It should be noted that the two training phases described above use the same loss function, which is described as:

wherein Lrpn is a loss function of a regional proposed network layer, Lrcnn is a loss function of a regional convolutional neural network layer, θ is a parameter of a corresponding module, G is a gradient decoupling layer, y is a corresponding label for supervising network learning, specifically, the label corresponding to y includes a category label and a position label, the category label refers to which object (such as cat, dog, etc.) the current target belongs to, and the position label refers to where (x, y, w, h) the current object belongs, and the model is supervised and trained through the two kinds of information.

Fig. 7 is a schematic structural diagram of a gradient decoupling layer according to an embodiment of the present invention, and as shown in the drawing, the gradient decoupling layer is a bidirectional propagation and adjustment module, and during a forward propagation process, the module performs a learnable affine transformation (solid line in fig. 7) on a depth feature x of a sample, for example, a linear transformation y ═ Ax + b, where A, b is a learnable parameter. And forward decoupling between the basic features and the downstream network layer is realized through affine transformation.

In the back propagation process, the input of the back propagation is the back propagation gradient (dashed line in fig. 7) from the regional suggestion network layer and the regional convolution neural network layer, and the gradient decoupling layer linearly transforms the back propagation gradient (for example, the back propagation gradient can be multiplied by a constant λ). Through the linear transformation, the reverse decoupling of the characteristics of the regional proposal network layer and the regional convolution neural network layer from the upstream task is completed. Its mathematical expression can be described as:

wherein x is the features input into the module, A is an affine transformation layer (i.e. a regional proposal network layer or a regional convolutional neural network layer), λ is a decoupling constant, and G is a gradient decoupling layer.

The two stages are both provided with gradient decoupling layers for optimization, and the target detection model obtained through the training process can effectively improve the domain transfer capability of the model and reduce the overfitting risk of the model.

Fig. 8 is a schematic diagram of a working principle of a target detection model in an actual application scenario, as shown in fig. 8, the target detection model includes a first feature extractor, two gradient decoupling layers, a regional suggestion network layer, a regional convolution neural network layer, and a correction module.

During training, the sample graph is input into the first feature extractor, and the basic features of the sample graph are obtained. Respectively inputting the basic characteristics into the two gradient decoupling layers; in the first gradient decoupling layer, carrying out scaling transformation on the basic features by using first decoupling parameters to obtain first decoupling features; in the second gradient decoupling layer, carrying out scaling transformation on the basic features by using second decoupling parameters to obtain second decoupling features; inputting the first decoupling characteristic into a regional suggestion network layer to obtain a regional suggestion characteristic; inputting the area suggestion feature and the second decoupling feature into an area convolution neural network layer to obtain a prediction detection result; carrying out loss function-based training by using the prediction detection result and label information in the sample image, and transmitting the gradient back to the regional suggestion network layer and the regional convolution neural network layer; and for the returned characteristics, the first gradient decoupling layer and the second gradient decoupling layer respectively carry out scaling processing on the returned gradient, and the process is repeated in this way until the training convergence condition is met, the training is finished, and the detection result corresponding to the target object is obtained. The trained target detection model comprises a first decoupling parameter and a second decoupling parameter. And further, inputting the target classification score in the detection result into a correction module to obtain a corrected target classification score.

For the above-mentioned correction module, an embodiment of the present invention further provides a schematic diagram of an operating principle of the correction module, as shown in fig. 9, where the module includes a second feature extractor.

When the method is applied, a second type sample image is input into a second feature extractor to obtain a second type sample image feature set, a subset category feature corresponding to each category is obtained based on the second type sample image feature set, an image to be detected is input into the second feature extractor to obtain a prediction feature, the subset category feature corresponding to the category of the prediction feature and the prediction feature are subjected to cosine similarity calculation, the cosine similarity is determined to be a correction classification score, and finally the correction classification score and the target classification score are added to obtain a corrected target classification score.

Based on the foregoing method embodiment, an embodiment of the present invention further provides a target detection apparatus, as shown in fig. 10, where a target detection model is prestored in the apparatus, the target detection model includes a first feature extractor, at least one gradient decoupling layer, a regional suggestion network layer, and a regional convolutional neural network layer, and the gradient decoupling layer is configured with a first decoupling parameter corresponding to the regional suggestion network layer and a second decoupling parameter corresponding to the regional convolutional neural network, and the apparatus includes:

the feature extraction module 1002 is configured to perform feature extraction on an image to be detected through a first feature extractor to obtain a basic feature corresponding to the image to be detected;

a decoupling module 1004, configured to process the basic features by using the first decoupling parameter and the second decoupling parameter through the gradient decoupling layer, respectively, to obtain a first decoupling feature corresponding to the first decoupling parameter and a second decoupling feature corresponding to the second decoupling parameter; wherein the first and second decoupling parameters are different such that a similarity between the first decoupling characteristic and the base characteristic and a similarity between the second decoupling characteristic and the base characteristic are different;

a region suggestion module 1006, configured to process the first decoupling feature through a region suggestion network layer to obtain a region suggestion feature;

the detection result determining module 1008 is configured to process the second decoupling characteristic and the area recommendation characteristic through the area convolutional neural network layer to obtain a detection result corresponding to the target object; the detection result comprises a target frame and a classification score corresponding to the target frame.

In the target detection device provided by the embodiment of the invention, the first feature extractor is used for extracting the features of the image to be detected to obtain the basic features corresponding to the image to be detected; processing the basic features by using a first decoupling parameter and a second decoupling parameter through a gradient decoupling layer to obtain a first decoupling feature corresponding to the first decoupling parameter and a second decoupling feature corresponding to the second decoupling parameter; processing the first decoupling characteristic through a regional suggestion network layer to obtain a regional suggestion characteristic; and processing the second decoupling characteristic and the region suggestion characteristic through a region convolution neural network layer to obtain a detection result. According to the embodiment of the invention, the extracted image features are decoupled before being input into the area suggestion network layer and the area convolution neural network layer, and the image features, the area suggestion network and the coupling degree between the image features and the area convolution neural network are controlled through different decoupling parameters.

The decoupling module 1004 is further configured to input the basic features into the gradient decoupling layer; multiplying the first decoupling parameter by the basic feature through a gradient decoupling layer to obtain a first product, and determining the first product as a first decoupling feature; and multiplying the second decoupling parameter by the basic characteristic through the gradient decoupling layer to obtain a second product, and determining the second product as a second decoupling characteristic.

The target detection model is obtained by training through the following method: acquiring a first initial decoupling parameter and a second initial decoupling parameter; training an initial target detection model through a first type of sample image, and obtaining a first target detection model when a first model convergence condition is met; the type of the object contained in the first type sample image is different from that of the target object; training the first target detection model through the second type of sample image, and determining the first target detection model meeting the second model convergence condition as a target detection model; the type of the object contained in the second type sample image is the same as that of the target object, and the number of the first type sample images is larger than that of the second type sample images.

Referring to fig. 11, a schematic structural diagram of another object detection apparatus, where an object detection model predicted on an electronic device to which the apparatus is applied further includes a second feature extractor connected to a regional convolutional neural network layer on the basis of the object detection model, and on the basis of the apparatus, the apparatus further includes: the second type sample image feature extraction module 1102 is configured to perform feature extraction on a second type sample image through a second feature extractor to obtain a second type sample image feature set; the predicted feature extraction module 1104 is configured to perform feature extraction on the image to be detected through the second feature extractor to obtain a predicted feature corresponding to the image to be detected; a corrected classification score determining module 1106, configured to determine a corrected classification score according to the second type sample image feature set and the predicted features; the correcting module 1108 is configured to correct the classification score corresponding to the target frame according to the corrected classification score, so as to obtain a corrected classification score.

The target detection model further comprises another gradient decoupling layer, and the other gradient decoupling layer is positioned behind the regional convolutional neural network layer and the regional suggestion network layer; the process of processing the second decoupling characteristic and the regional proposal characteristic through the regional convolutional neural network layer to obtain the detection result corresponding to the target object includes: processing the second decoupling characteristic and the regional suggestion characteristic through a regional convolutional neural network layer to obtain a regression characteristic and a classification characteristic; processing the regression features by using a third decoupling parameter through another gradient decoupling layer to obtain a target frame; processing the classification characteristics by using a fourth decoupling parameter through another gradient decoupling layer to obtain a classification score corresponding to the target frame; and determining the target frame and the classification score corresponding to the target frame as a detection result corresponding to the target object.

The corrected classification score determining module 1106 is further configured to divide the second type sample feature set into a plurality of second type sample subsets according to the categories corresponding to the features; wherein each second type sample subset corresponds to a type; determining a target category corresponding to the prediction feature; and determining a corrected classification score according to the second type sample subset corresponding to the target class and the prediction characteristics.

The process of determining a corrected classification score according to the second type sample subset corresponding to the target class and the prediction features includes: determining subset category characteristics of a second type sample subset corresponding to the target category; and calculating the similarity of the subset category features and the predicted features, and determining the similarity as a corrected classification score.

The similarity between the subset classification feature and the prediction feature is cosine similarity.

The correcting module 1108 is further configured to perform weighted summation on the corrected classification score and the target classification score to obtain a sum, and determine the sum as the corrected classification score.

The object detection apparatus provided by the embodiment of the present invention has the same implementation principle and technical effect as the foregoing method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the foregoing method embodiments for detecting an object without reference to the above embodiment portions of the apparatus.

An embodiment of the present invention further provides an electronic device, as shown in fig. 12, which is a schematic structural diagram of the electronic device, where the electronic device includes a processor 1201 and a memory 1202, the memory 1202 stores computer-executable instructions that can be executed by the processor 1201, and the processor 1201 executes the computer-executable instructions to implement the object detection method.

In the embodiment shown in fig. 12, the electronic device further comprises a bus 1203 and a communication interface 1204, wherein the processor 1201, the communication interface 1204 and the memory 1202 are connected by the bus 1203.

The Memory 1202 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 1204 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used. The bus 1203 may be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 1203 may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 12, but that does not indicate only one bus or one type of bus.

The processor 1201 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 1201. The Processor 1201 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and the processor 1201 reads information in the memory and completes the steps of the object detection method of the foregoing embodiment in combination with hardware thereof.

Embodiments of the present invention further provide a computer-readable storage medium, where the computer-readable storage medium stores computer-executable instructions, and when the computer-executable instructions are called and executed by a processor, the computer-executable instructions cause the processor to implement the target detection method, and specific implementation may refer to the foregoing method embodiments, and is not described herein again.

The embodiment of the present invention further provides a computer program product, where the computer program product includes computer-executable instructions, and when the computer-executable instructions are called and executed by a processor, the target detection method is implemented, and specific implementation may refer to the foregoing method embodiment, and is not described herein again.

The object detection method, the object detection device, and the computer program product of the electronic device provided in the embodiments of the present invention include a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute the method described in the foregoing method embodiments, and specific implementation may refer to the method embodiments, and will not be described herein again.

Unless specifically stated otherwise, the relative steps, numerical expressions, and values of the components and steps set forth in these embodiments do not limit the scope of the present invention.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A target detection method is applied to an electronic device, a target detection model is prestored on the electronic device, the target detection model comprises a first feature extractor, a gradient decoupling layer, a regional suggestion network layer and a regional convolutional neural network layer, and the gradient decoupling layer is configured with a first decoupling parameter corresponding to the regional suggestion network layer and a second decoupling parameter corresponding to the regional convolutional neural network layer, and the method comprises the following steps:

extracting the features of the image to be detected through the first feature extractor to obtain the basic features corresponding to the image to be detected;

processing the basic features by the gradient decoupling layer by respectively utilizing the first decoupling parameter and the second decoupling parameter to obtain a first decoupling feature corresponding to the first decoupling parameter and a second decoupling feature corresponding to the second decoupling parameter; wherein the first and second decoupling parameters are different such that a similarity between the first decoupling characteristic and the base characteristic and a similarity between the second decoupling characteristic and the base characteristic are different;

processing the first decoupling characteristic through the regional suggestion network layer to obtain a regional suggestion characteristic;

processing the second decoupling characteristic and the region suggestion characteristic through the region convolutional neural network layer to obtain a detection result corresponding to the target object; and the detection result comprises a target frame and a classification score corresponding to the target frame.

2. The method of claim 1, wherein the step of obtaining a first decoupling characteristic corresponding to the first decoupling parameter and a second decoupling characteristic corresponding to the second decoupling parameter by processing the basic characteristics with the first decoupling parameter and the second decoupling parameter, respectively, through the gradient decoupling layer comprises:

inputting the base features into the gradient decoupling layer;

multiplying the first decoupling parameter and the basic feature through the gradient decoupling layer to obtain a first product, and determining the first product as a first decoupling feature;

and multiplying the second decoupling parameter and the basic characteristic through the gradient decoupling layer to obtain a second product, and determining the second product as a second decoupling characteristic.

3. The method of claim 1 or 2, wherein the object detection model is trained by:

acquiring a first initial decoupling parameter and a second initial decoupling parameter;

training an initial detection model through a first class sample image, and obtaining a first detection model when meeting a first model convergence condition; wherein the type of the object contained in the first type of sample image is different from that of the target object;

training the first detection model through a second type of sample image, and determining the first detection model meeting a second model convergence condition as the target detection model; the type of the object contained in the second type sample image is the same as that of the target object, and the number of the first type sample images is larger than that of the second type sample images.

4. The method of any one of claims 1-3, wherein the object detection model further comprises a second feature extractor coupled to the regional convolutional neural network layer;

after the step of obtaining the detection result corresponding to the target object, the method further includes:

performing feature extraction on the second type sample image through the second feature extractor to obtain a second type sample image feature set;

extracting the features of the image to be detected through the second feature extractor to obtain the corresponding prediction features of the image to be detected;

determining a corrected classification score according to the second type sample image feature set and the predicted features;

and correcting the classification score corresponding to the target frame according to the corrected classification score to obtain the corrected classification score.

5. The method of any of claims 1-4, wherein the object detection model further comprises another gradient decoupling layer; the other gradient decoupling layer is positioned after the regional convolutional neural network layer;

processing the second decoupling characteristic and the region suggestion characteristic through the region convolutional neural network layer to obtain a detection result corresponding to the target object, wherein the step of processing the second decoupling characteristic and the region suggestion characteristic through the region convolutional neural network layer comprises the following steps of:

processing the second decoupling characteristic and the region suggestion characteristic through the region convolutional neural network layer to obtain a regression characteristic and a classification characteristic;

processing the regression feature by using a third decoupling parameter through the other gradient decoupling layer to obtain the target frame;

and processing the classification features by using a fourth decoupling parameter through the other gradient decoupling layer to obtain a classification score corresponding to the target frame.

6. The method of claim 4, wherein the step of determining a corrected classification score based on the second type of sample feature set and the predicted features comprises:

dividing the second type sample characteristic set into a plurality of second type sample subsets according to the categories corresponding to the characteristics; wherein each of the second type sample subsets corresponds to a category;

determining a target class corresponding to the prediction feature;

and determining a corrected classification score according to the second type sample subset corresponding to the target class and the prediction characteristics.

7. The method of claim 6, wherein the step of determining a corrected classification score based on the second class sample subset corresponding to the target class and the predicted features comprises:

determining a subset category characteristic of a second type sample subset corresponding to the target category; wherein, the subset category characteristics are used for characterizing common characteristics of all samples in the second type sample subset;

and calculating the similarity of the subset category features and the prediction features, and determining the similarity as a corrected classification score.

8. The method of claim 7, the subset category feature having a cosine similarity to the predicted feature.

9. The method according to claim 4, wherein the step of correcting the classification score corresponding to the target frame according to the corrected classification score to obtain a corrected classification score includes:

and weighting and summing the corrected classification score and the classification score to obtain a sum value, and determining the sum value as the corrected classification score.

10. An electronic device comprising a processor and a memory, the memory storing computer-executable instructions executable by the processor, the processor executing the computer-executable instructions to implement the method of any of claims 1 to 9.

11. A computer-readable storage medium having computer-executable instructions stored thereon which, when invoked and executed by a processor, cause the processor to implement the method of any of claims 1 to 9.

12. A computer program product, characterized in that the computer program product comprises a computer program which, when being executed by a processor, carries out the method of any one of claims 1 to 9.