CN110781980B

CN110781980B - Training method of target detection model, target detection method and device

Info

Publication number: CN110781980B
Application number: CN201911093997.9A
Authority: CN
Inventors: 鲁方波; 汪贤; 樊鸿飞; 蔡媛
Original assignee: Beijing Kingsoft Cloud Network Technology Co Ltd; Beijing Kingsoft Cloud Technology Co Ltd
Current assignee: Beijing Kingsoft Cloud Network Technology Co Ltd; Beijing Kingsoft Cloud Technology Co Ltd
Priority date: 2019-11-08
Filing date: 2019-11-08
Publication date: 2022-04-12
Anticipated expiration: 2039-11-08
Also published as: CN110781980A

Abstract

The invention provides a training method of a target detection model, a target detection method and a device, which relate to the technical field of image processing, and the method comprises the following steps: acquiring a training image; the training image carries label information of the target object; inputting the training image into a candidate region detection network to obtain a candidate region of the target object; inputting the training image and the candidate area into a key part analysis network to obtain a first characteristic output by the key part analysis network; and training the target detection model to be trained through the first characteristics output by the training image, the candidate region and the key part analysis network to obtain the trained target detection model. The method can effectively improve the accuracy of the target detection model.

Description

Training method of target detection model, target detection method and device

Technical Field

The invention relates to the technical field of image processing, in particular to a training method of a target detection model, a target detection method and a target detection device.

Background

The target detection technology is a basis of the computer vision technology, the target detection technology can detect various target objects such as human figures, animals or articles contained in the images, and the target detection technology can be applied to many scenes in practical application, for example, people statistics or image restoration is realized based on the target detection technology, and in the case of realizing image restoration based on the target detection technology, the target objects contained in the images are detected through the target detection technology, the area of the target objects in the images is determined, and then the targeted restoration is performed on the area. At present, the target detection technology mainly includes a traditional target detection method and a target detection method based on deep learning, wherein the traditional target detection method has a low detection accuracy, and the target detection method based on deep learning generally trains a neural network, so that a target object included in an image is detected through the trained neural network, but for a low-resolution image or a situation that an area where the target object is located occupies a small area in the image, the trained neural network still cannot accurately detect the target object included in the image, that is, the existing neural network for detecting the target object has a low accuracy.

Disclosure of Invention

In view of this, the present invention provides a training method for a target detection model, a target detection method and a target detection device, which can effectively improve the accuracy of a neural network for detecting a target object.

In a first aspect, an embodiment of the present invention provides a method for training a target detection model, including: acquiring a training image; the training image carries label information of a target object; inputting the training image into a candidate region detection network to obtain a candidate region of the target object; inputting the training image and the candidate area into a key part analysis network to obtain a first feature output by the key part analysis network; and training a target detection model to be trained through the training image, the candidate region and the first characteristic output by the key part analysis network to obtain the trained target detection model.

With reference to the first aspect, an embodiment of the present invention provides a first possible implementation manner of the first aspect, where the step of inputting the training image and the candidate region into a key location analysis network to obtain a first feature output by the key location analysis network includes: inputting the training image and the candidate region into the key part analysis network, and acquiring a first feature output by a first designated convolutional layer of the key part analysis network aiming at the training image and the candidate region; the step of training the target detection model to be trained through the training image, the candidate region and the first feature output by the key part analysis network includes: inputting the training image and the candidate region into a target detection model to be trained, and acquiring a second characteristic output by a second specified convolutional layer of the target detection model to be trained aiming at the training image and the candidate region; performing feature fusion on the first feature and the second feature to obtain a feature fusion result of the second specified convolutional layer; and training the target detection model to be trained based on the feature fusion result.

With reference to the first aspect, an embodiment of the present invention provides a second possible implementation manner of the first aspect, where the number of the first specified convolutional layers and the second specified convolutional layers is multiple, and the second specified convolutional layers correspond to the first specified convolutional layers one to one; the step of performing feature fusion on the first feature and the second feature to obtain a feature fusion result of the second specified convolutional layer includes: and performing feature fusion on the second feature output by the second specified convolutional layer and the first feature output by the first specified convolutional layer corresponding to the second specified convolutional layer to obtain a feature fusion result of the second specified convolutional layer.

With reference to the first aspect, an embodiment of the present invention provides a third possible implementation manner of the first aspect, where the step of training the target detection model to be trained based on the feature fusion result includes: respectively performing upsampling processing of different proportions on the feature fusion result of each second specified convolutional layer to obtain an upsampling result corresponding to each second specified convolutional layer, and performing feature fusion on each upsampling result to obtain an upsampling fusion feature; respectively performing downsampling processing of different proportions on the feature fusion result of each second designated convolutional layer to obtain a downsampling result corresponding to each second designated convolutional layer, and performing feature fusion on each downsampling result to obtain downsampling fusion features; performing regression processing on the up-sampling fusion characteristics to obtain a regression processing result; and carrying out classification processing on the downsampling fusion characteristics to obtain a classification processing result; and training the target detection model to be trained based on the regression processing result and the classification processing result.

With reference to the first aspect, an embodiment of the present invention provides a fourth possible implementation manner of the first aspect, where the candidate area detection network and the key part analysis network are obtained by training based on a training image of the target detection model; the training image also carries label information of a foreground region when the candidate region detection network is trained; the training image also carries the label information of the key part when training the key part analysis network.

In a second aspect, an embodiment of the present invention further provides a target detection method, including: acquiring an image to be detected; inputting the image to be detected into a target detection model obtained by pre-training; wherein the target detection model is obtained by training with any one of the methods provided by the first aspect; detecting the image to be detected through the target detection model to obtain a target detection result; the target detection result comprises position information of a target object in the image to be detected.

In a third aspect, an embodiment of the present invention further provides a training apparatus for a target detection model, including: the training image acquisition module is used for acquiring a training image; the training image carries label information of a target object; a candidate region detection module, configured to input the training image to a candidate region detection network to obtain a candidate region of the target object; the first feature acquisition module is used for inputting the training image and the candidate region into a key part analysis network to obtain a first feature output by the key part analysis network; and the training module is used for training a target detection model to be trained through the training image, the candidate region and the first characteristic output by the key part analysis network to obtain the trained target detection model.

In a fourth aspect, an embodiment of the present invention further provides an object detection apparatus, including: the image acquisition module to be detected is used for acquiring an image to be detected; the image to be detected input module is used for inputting the image to be detected into a target detection model obtained by pre-training; wherein the target detection model is obtained by training with any one of the methods provided by the first aspect; the detection module is used for detecting the image to be detected through the target detection model to obtain a target detection result; the target detection result comprises position information of a target object in the image to be detected.

In a fifth aspect, an embodiment of the present invention further provides an electronic device, including a processor and a memory; the memory has stored thereon a computer program which, when executed by the processor, performs the method of any one of the aspects as provided in the first aspect, or performs the method as provided in the second aspect.

In a sixth aspect, the present invention further provides a computer storage medium for storing computer software instructions for performing any one of the methods provided in the first aspect, or performing the method provided in the second aspect.

The embodiment of the invention provides a training method and a training device for a target detection model, which are used for obtaining a training image carrying label information of a target object, inputting the training image into a candidate region detection model to obtain a candidate region of the target object, and inputting the training image and the candidate region into a key part analysis network to obtain a first characteristic output by the key part analysis network, so that the target detection model to be trained is trained through the training image, the candidate region and the first characteristic to obtain the trained target detection model. The embodiment of the invention trains the target detection model based on the training image, the candidate area output by the candidate area detection network and the first characteristic output by the key part analysis network, and comprehensively improves the capability of the trained target detection model for detecting the target object, thereby effectively improving the detection accuracy of the trained target detection model when detecting the target object.

The target detection method and device provided by the embodiment of the invention are used for acquiring the image to be detected, inputting the image to be detected into the target detection model obtained by training the training method of the target detection model, and detecting the image to be detected through the target detection model to obtain the target detection result comprising the position information of the target object in the image to be detected. According to the embodiment of the invention, the target detection model with higher detection accuracy rate is used for detecting the image to be detected, so that the accuracy of the detected target object can be effectively improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic flowchart of a training method of a target detection model according to an embodiment of the present invention;

fig. 2 is a schematic connection diagram of a candidate area detection network, a key location analysis network, and a target detection model according to an embodiment of the present invention;

fig. 3 is a schematic flowchart of a target detection method according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a training apparatus for a target detection model according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the embodiments, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

At present, the target detection technology mainly comprises a traditional target detection method and a target detection method based on deep learning, and for the traditional target detection method, a human head detection method is taken as an example, the method can be used for carrying out foreground extraction on an input video image and simultaneously zooming each frame of image to different sizes, extracting pixel difference characteristics for each image window with fixed size, and inputting the extracted pixel difference characteristics into an Adaboost multi-view classifier for human head detection, but because the method adopts a traditional algorithm to extract the pixel difference characteristics, and the pixel difference characteristics extracted by the traditional algorithm have the problem of lower accuracy, the accuracy of the human head detected by the method is lower; for a target detection method based on deep learning, a human head detection method and a human head detection device based on deep learning are taken as an example, the method utilizes a large amount of labeled data to train a neural network so that the neural network learns human figure features, and the accuracy rate of detecting images in the mode is higher than that of the traditional target detection method, but the method adopts the same network to process human figures with different sizes in the images, and for the images with low resolution or smaller area occupation ratio of the human figures, the method cannot extract the human figure features in the images, further cannot effectively capture the human figures in the images, and therefore the accuracy rate of target detection is lower. Based on the above, the training method of the target detection model, the target detection method and the device provided by the embodiment of the invention can effectively improve the accuracy of the neural network for detecting the target object.

To facilitate understanding of the present embodiment, first, a detailed description is given to a training method of a target detection model disclosed in the present embodiment, referring to a flowchart of the training method of a target detection model shown in fig. 1, where the method mainly includes the following steps S102 to S108:

step S102, training images are obtained.

The training image carries the label information of the target object, and the label information of the target object can be obtained through manual labeling. In practical applications, in order to improve the detection accuracy of the target detection model, a deep Convolutional Neural Network vgg (visual Geometry Group Network) may also be used as the target detection model. In one embodiment, the training image carrying the label information of the target object may be obtained from a database storing a training set.

And step S104, inputting the training image into a candidate area detection network to obtain a candidate area of the target object.

The candidate region detection network is used for detecting a region where the target object may be located in the training image, and the candidate region is also the region where the target object may be located in the training image. In a specific implementation, the training image may be input into a candidate region detection network trained in advance, so that a candidate region of the target object is output for the training image through the candidate region detection network trained in advance, where the candidate region network may be obtained based on training of the training image of the target detection model.

And step S106, inputting the training image and the candidate area into a key part analysis network to obtain a first feature output by the key part analysis network.

The key part analysis network is used for analyzing key part information of the target object and can be obtained based on training images of the target detection model. In order to facilitate understanding of the key part information, the embodiment of the present invention describes the key part information by taking an example in which the target object is a portrait, and when the target object is a portrait, the key part information may be understood as body information such as hand information, head information, face information, and limb information. In one embodiment, the key portion analysis network includes a plurality of convolutional layers, and the first feature is a convolutional feature output by a designated convolutional layer of the plurality of convolutional layers with respect to the training image and the candidate region.

And S108, training the target detection model to be trained through the first characteristics output by the training image, the candidate region and the key part analysis network to obtain the trained target detection model.

In one embodiment, the training image and the candidate region may be input into a key part analysis network and a target detection model respectively, and the key part analysis network performs feature fusion on the first features output by the training image and the candidate region and the features output by the target detection model for the training image and the candidate region to obtain a feature fusion result, and trains the target detection model based on the feature fusion result, and when a loss function of the target detection model converges, it may be determined that the training of the target detection model is completed.

In the training method for the target detection model provided by the embodiment of the invention, the training image carrying the label information of the target object is obtained, the training image is input into the candidate region detection model to obtain the candidate region of the target object, and the training image and the candidate region are input into the key part analysis network to obtain the first feature output by the key part analysis network, so that the target detection model to be trained is trained through the training image, the candidate region and the first feature to obtain the trained target detection model. The embodiment of the invention trains the target detection model based on the training image, the candidate area output by the candidate area detection network and the first characteristic output by the key part analysis network, and comprehensively improves the capability of the trained target detection model for detecting the target object, thereby effectively improving the detection accuracy of the trained target detection model when detecting the target object.

In practical application, the candidate area detection network and the key part analysis network are obtained by training based on a training image of a target detection model. Training imagesWhen the candidate area detection network is trained, Label information of a foreground area is also carried, at this time, the training image is used as the input of the candidate area detection network, and Label information Label1 of the foreground area carried by the training image is used as the GT of the candidate area detection network, so that the candidate area detection network is trained, wherein the foreground area can embody the environmental information in the training image. An embodiment of the present invention provides a candidate area detection network, and specifically, a training image is input into a candidate area generation network (that is, the candidate area detection network) NET1 to obtain m candidate areas, a SET of the m candidate areas is denoted as REGION _ SET, and an nth candidate area is denoted as REGION _ SET_nSpecifically, the candidate region_nCan be expressed as: region_n＝(x_n,y_n,w_n，h_n) The REGION _ SET may be represented as: REGION _ SET ═ REGION₁,region₂,...region_n...,region_m}. Where m denotes the number of candidate regions, x_nAbscissa, y, representing the nth candidate region_nDenotes the ordinate, w, of the nth candidate region_nIndicates the width of the nth candidate region, h_nIndicating the height of the nth candidate region. In practical applications, the value of n may be set based on the complexity of the problem, for example, n may be set to 500 for conventional portrait detection, and the value of n may be increased appropriately for detecting a scene containing more target objects, so as to obtain more candidate regions. In specific implementation, the candidate Region detection Network may adopt an RPN (Region generation Network) Network, and the RPN Network is trained and learned based on the training image and Label information Label1 of the foreground Region carried by the training image, and a finally trained candidate Region detection Network is obtained through repeated iteration.

The training images also carry Label information of key parts when training the key part analysis network, at the moment, the training images and the candidate REGION _ SET are used as the input of the key part analysis network, Label information Label2 of the key parts carried by the training images is used as GT of the key part analysis network, the Label information Label2 is jointly sent into the key part analysis network NET2 to be learned, and finally the trained key part analysis network is obtained through repeated iteration. If the target object is a portrait, the key part analysis network may also be referred to as a human body analysis network.

An embodiment of the present invention provides a specific implementation manner of step S106, specifically, the training image and the candidate region are input to a key part analysis network, and a first feature output by a first designated convolutional layer of the key part analysis network with respect to the training image and the candidate region is obtained. The key part analysis network comprises a plurality of convolutional layers, and during specific implementation, one or more convolutional layers can be selected from the key part analysis network to serve as first designated convolutional layers, and convolutional characteristics output by each first convolutional layer based on a training image and a candidate area are obtained, wherein the convolutional characteristics output by each first convolutional layer are also first characteristics.

In an implementation manner, an embodiment of the present invention provides a specific implementation manner of step S108, see steps 1 to 3 as follows:

step 1, inputting the training image and the candidate region into a target detection model to be trained, and acquiring a second characteristic output by a second specified convolutional layer of the target detection model to be trained aiming at the training image and the candidate region. The target detection model may also include a plurality of convolutional layers, and one or more convolutional layers are selected from the target detection model as a second designated convolutional layer, and the second designated convolutional layer is in one-to-one correspondence with the first designated convolutional layer, for example, if the first designated convolutional layer is a fourth convolutional layer in the key location analysis network, the second designated convolutional layer will be the fourth convolutional layer in the target detection model. In a specific implementation, the convolution feature output by each second convolution layer based on the training image and the candidate region may be used as the second feature.

And 2, performing feature fusion on the first feature and the second feature to obtain a feature fusion result of the second specified convolutional layer.

To facilitate understanding of step 2, an embodiment of the present invention provides a schematic connection diagram of a candidate area detection network, a key location analysis network, and a target detection model, as shown in fig. 2, wherein training images are respectively input into the candidate area detection network NET1, the key location analysis network NET2, and the target detection model NET3, a candidate area REGION _ SET output by the candidate area detection network NET1 is respectively input into the key location analysis network NET2 and the target detection model NET3, the key location analysis network NET2 and the target detection model NET3 shown in fig. 2 both include six convolutional layers, and convolutional features output by the sixth layer in the key location analysis network NET2 output an analysis result including key location information after passing through all connection layers.

Based on fig. 2, in the embodiment of the present invention, the number of the first specified convolutional layer and the second specified convolutional layer is plural, and the embodiment of the present invention provides a specific implementation manner of performing feature fusion on the first feature and the second feature to obtain a feature fusion result of the second specified convolutional layer, specifically, for each second specified convolutional layer, performing feature fusion on the second feature output by the second specified convolutional layer and the first feature output by the first specified convolutional layer corresponding to the second specified convolutional layer to obtain a feature fusion result of the second specified convolutional layer. Taking fig. 2 as an example, the first designated convolutional layer may include a fourth convolutional layer, a fifth convolutional layer and a sixth convolutional layer in the key location analysis network NET2, the second designated convolutional layer may include a fourth convolutional layer, a fifth convolutional layer and a sixth convolutional layer in the target detection model NET3, when the step of performing feature fusion on the first feature and the second feature is executed, the second feature output by each second designated convolutional layer in the target detection model NET3 is fused with the first feature output by the corresponding first designated convolutional layer in the key part analysis network NET2, and the feature fusion result of the lower convolutional layer is used as the input of the higher convolutional layer, wherein the base layer convolutional layer and the higher layer convolutional layer belong to relative concepts, for example, for the fifth layer convolutional layer and the sixth layer convolutional layer, the fifth layer convolutional layer belongs to the base layer convolutional layer and the sixth layer convolutional layer belongs to the higher layer convolutional layer; and for the fourth and fifth convolutional layers, then the fourth convolutional layer belongs to the bottom convolutional layer, and the fifth convolutional layer belongs to the upper convolutional layer.

During specific implementation, the NET2-CONV4 output by the fourth layer of convolutional layer of the key part analysis network NET2 and the NET3-CONV4 output by the fourth layer of convolutional layer of the target detection model NET3 are subjected to feature fusion to obtain a feature fusion result M4; inputting the feature fusion result M4 into the fifth layer convolution layer of the key part analysis network NET2 and the fifth layer convolution layer of the target detection model NET3 respectively, and performing feature fusion on NET2-CONV5 output by the fifth layer convolution layer of the key part analysis network NET2 and NET3-CONV5 output by the fifth layer convolution layer of the target detection model NET3 to obtain a feature fusion result M5; the feature fusion result M5 is input to the sixth convolutional layer of the key part analysis network NET2 and the sixth convolutional layer of the target detection model NET3, and the feature fusion result M6 is obtained by performing feature fusion on NET2-CONV6 output by the sixth convolutional layer of the key part analysis network NET2 and NET3-CONV6 output by the sixth convolutional layer of the target detection model NET 3. In practical application, only the weight of the target detection model NET3 network is updated, and the weight of the key part analysis network NET2 is not updated.

And 3, training the target detection model to be trained based on the feature fusion result.

When the step of training the target detection model to be trained based on the feature fusion result is executed, the target detection model to be trained may be trained according to the following steps 3.1 to 3.4:

and 3.1, respectively carrying out upsampling processing of different proportions on the feature fusion result of each second specified convolutional layer to obtain an upsampling result corresponding to each second specified convolutional layer, and carrying out feature fusion on each upsampling result to obtain an upsampling fusion feature. Considering that the aspect ratio values of the feature maps output between adjacent convolutional layers are all 2, performing UP-sampling on the feature fusion result M5 by 2 times obtains an UP-sampling result M5UP2, and performing UP-sampling on the feature fusion result M6 by 4 times obtains an UP-sampling result M6UP4, thereby obtaining a feature map having the same size as the feature fusion result M4, and the UP-sampling process performed on the feature fusion result M4 can be understood as 0-time UP-sampling, so that the UP-sampling result M4UP0 is substantially the feature fusion result M4. On the basis, feature fusion processing is carried out on the feature fusion result M4, the UP-sampling result M5UP2 and the UP-sampling result M6UP4, and the UP-sampling fusion feature MR is obtained.

And 3.2, respectively carrying out downsampling processing of different proportions on the feature fusion results of the second specified convolutional layers to obtain downsampling results corresponding to the second specified convolutional layers, and carrying out feature fusion on the downsampling results to obtain downsampling fusion features. For example, the feature fusion result M4 is DOWN-sampled by 4 times to obtain a DOWN-sampling result M4DOWN4, the feature fusion result M5 is DOWN-sampled by 2 times to obtain a DOWN-sampling result M5DOWN2, the feature fusion result M6 is DOWN-sampled by 0 times to obtain a DOWN-sampling result M6DOWN0 (substantially, the feature fusion result M6), and the size of the DOWN-sampling result is the same as the size of the feature fusion result M6. On the basis, feature fusion processing is carried out on the DOWN-sampling result M4DOWN4, the DOWN-sampling result M5DOWN2 and the feature fusion result M6, and the DOWN-sampling fusion feature MC is obtained.

Step 3.3, performing regression processing on the upsampling fusion characteristics to obtain a regression processing result; and classifying the downsampling fusion characteristics to obtain a classification processing result. Considering that the classification operation can be completed only according to the high-level semantic features, and the border regression operation requires more low-level semantic information to obtain a more accurate target boundary, on the basis of fig. 2, the up-sampling fusion feature MR is input to the full-link layer FC8 to perform bound-box regression operation, so as to obtain a regression processing result, and the down-sampling fusion feature MC is input to the full-link layer FC7 to perform classification operation, so as to obtain a classification processing result.

And 3.4, training the target detection model to be trained based on the regression processing result and the classification processing result. And iterating the target detection model to be trained based on the regression processing result and the classification processing result, thereby obtaining the trained target detection model.

The embodiment of the invention can be better applied to portrait detection, and if the target detection model is applied to portrait detection, the target detection model can be called as a portrait detection network.

In summary, because human eyes do not consider a target object as an isolated object when observing the world, but consider the target object in combination with the surrounding environment, the embodiment of the present invention fuses environment information into the target detection model through the candidate area detection network, and performs feature fusion on the convolution features output by the key part analysis network and the target detection model, thereby effectively fusing different semantic features of different networks and different layer features of the same network, and effectively utilizing high and low layer semantic information and context semantic information, and further significantly improving the detection accuracy of the target detection model.

In view of the problem that the conventional target detection method has a low detection accuracy of a target object, the embodiment of the present invention provides a target detection method based on the training method of the target detection model, which is shown in fig. 3 as a schematic flow diagram of the target detection method, and the method mainly includes the following steps S302 to S306:

step S302, an image to be detected is obtained. The image to be detected may include a high resolution image, a low resolution image, or an image with a smaller area occupied by the target object, for example, the image to be detected includes an image to be detected when the image resolution is higher than a first resolution threshold, the image resolution is lower than a low resolution image of a second resolution threshold, and the area occupied by the target object is smaller than the area occupied by the target object.

And step S304, inputting the image to be detected into a target detection model obtained by pre-training. The target detection model is obtained by training by adopting any one of the training methods of the target detection model provided by the foregoing embodiments.

And S306, detecting the image to be detected through the target detection model to obtain a target detection result. And the target detection result comprises the position information of the target object in the image to be detected. The target detection model provided by the embodiment is more suitable for the process of human visual perception of the target object because the environmental information in the image is blended in the training process, so that the target detection result obtained by detection has higher accuracy.

According to the target detection method provided by the embodiment of the invention, the image to be detected is obtained and input into the target detection model obtained through the training of the target detection model, so that the image to be detected is detected through the target detection model, and a target detection result comprising the position information of the target object in the image to be detected is obtained. According to the embodiment of the invention, the target detection model with higher detection accuracy rate is used for detecting the image to be detected, so that the accuracy of the detected target object can be effectively improved.

As to the training method of the target detection model provided in the foregoing embodiment, an embodiment of the present invention provides a training apparatus of a target detection model, and referring to a schematic structural diagram of the training apparatus of a target detection model shown in fig. 4, the training apparatus may include the following components:

a training image acquisition module 402 for acquiring a training image; the training image carries label information of the target object.

And a candidate region detection module 404, configured to input the training image to a candidate region detection network to obtain a candidate region of the target object.

The first feature obtaining module 406 is configured to input the training image and the candidate region to a key portion analysis network, so as to obtain a first feature output by the key portion analysis network.

The training module 408 is configured to train the target detection model to be trained through the first feature output by the training image, the candidate region, and the key portion analysis network, so as to obtain a trained target detection model.

The embodiment of the invention trains the target detection model based on the training image, the candidate area output by the candidate area detection network and the first characteristic output by the key part analysis network, and comprehensively improves the capability of the trained target detection model for detecting the target object, thereby effectively improving the detection accuracy of the trained target detection model when detecting the target object.

In an embodiment, the first feature obtaining module 406 is further configured to: inputting the training image and the candidate region into the key part analysis network, and acquiring a first feature output by a first designated convolutional layer of the key part analysis network aiming at the training image and the candidate region; the training module 408 further comprises: the second characteristic acquisition unit is used for inputting the training image and the candidate area into the target detection model to be trained and acquiring a second characteristic output by a second specified convolutional layer of the target detection model to be trained aiming at the training image and the candidate area; the characteristic fusion unit is used for carrying out characteristic fusion on the first characteristic and the second characteristic to obtain a characteristic fusion result of the second specified convolutional layer; and the model training unit is used for training the target detection model to be trained based on the feature fusion result.

In one embodiment, the number of the first designated convolutional layer and the second designated convolutional layer is plural, and the second designated convolutional layer corresponds to the first designated convolutional layer one by one; the feature fusion unit is further configured to perform feature fusion on, for each second specified convolutional layer, a second feature output by the second specified convolutional layer and a first feature output by the first specified convolutional layer corresponding to the second specified convolutional layer to obtain a feature fusion result of the second specified convolutional layer.

In an embodiment, the model training unit is further configured to perform upsampling processing with different proportions on the feature fusion result of each second specified convolutional layer, to obtain an upsampling result corresponding to each second specified convolutional layer, and perform feature fusion on each upsampling result, to obtain an upsampled fusion feature; respectively performing downsampling processing of different proportions on the feature fusion result of each second designated convolutional layer to obtain a downsampling result corresponding to each second designated convolutional layer, and performing feature fusion on each downsampling result to obtain a downsampling fusion feature; performing regression processing on the up-sampling fusion characteristics to obtain a regression processing result; and classifying the downsampling fusion characteristics to obtain a classification processing result; and training the target detection model to be trained based on the regression processing result and the classification processing result.

In one embodiment, the candidate area detection network and the key part analysis network are obtained by training based on a training image of a target detection model; the training image also carries label information of a foreground region when the candidate region detection network is trained; the training image also carries the label information of the key part when training the key part analysis network.

With respect to the target detection method provided in the foregoing embodiment, an embodiment of the present invention provides a target detection apparatus, referring to a schematic structural diagram of a target detection apparatus shown in fig. 5, where the apparatus may include the following components:

an image to be detected acquisition module 502 is configured to acquire an image to be detected.

An image to be detected input module 504, configured to input an image to be detected to a pre-trained target detection model; the target detection model is obtained by training by adopting any one of the training methods of the target detection model provided by the foregoing embodiments.

The detection module 506 is configured to detect the image to be detected through the target detection model to obtain a target detection result; the target detection result comprises position information of the target object in the image to be detected.

According to the embodiment of the invention, the target detection model with higher detection accuracy rate is used for detecting the image to be detected, so that the accuracy of the detected target object can be effectively improved.

The device provided by the embodiment of the present invention has the same implementation principle and technical effect as the method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the method embodiments without reference to the device embodiments.

The device is an electronic device, and particularly, the electronic device comprises a processor and a storage device; the storage means has stored thereon a computer program which, when executed by the processor, performs the method of any of the above described embodiments.

Fig. 6 is a schematic structural diagram of an electronic device 100 according to an embodiment of the present invention, where the electronic device 100 includes: a processor 60, a memory 61, a bus 62 and a communication interface 63, wherein the processor 60, the communication interface 63 and the memory 61 are connected through the bus 62; the processor 60 is arranged to execute executable modules, such as computer programs, stored in the memory 61.

The Memory 61 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 63 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.

The bus 62 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 6, but that does not indicate only one bus or one type of bus.

The memory 61 is used for storing a program, the processor 60 executes the program after receiving an execution instruction, and the method executed by the apparatus defined by the flow process disclosed in any of the foregoing embodiments of the present invention may be applied to the processor 60, or implemented by the processor 60.

The processor 60 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 60. The Processor 60 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory 61, and the processor 60 reads the information in the memory 61 and, in combination with its hardware, performs the steps of the above method.

The computer program product of the readable storage medium provided in the embodiment of the present invention includes a computer readable storage medium storing a program code, where instructions included in the program code may be used to execute the method described in the foregoing method embodiment, and specific implementation may refer to the foregoing method embodiment, which is not described herein again.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for training a target detection model, comprising:

acquiring a training image; the training image carries label information of a target object;

inputting the training image into a candidate region detection network to obtain a candidate region of the target object;

inputting the training image and the candidate area into a key part analysis network to obtain a first feature output by the key part analysis network;

training a target detection model to be trained through the training image, the candidate region and the first feature output by the key part analysis network to obtain a trained target detection model;

the step of training the target detection model to be trained through the training image, the candidate region and the first feature output by the key part analysis network includes:

inputting the training image and the candidate region into a target detection model to be trained, and acquiring a second characteristic output by a second specified convolutional layer of the target detection model to be trained aiming at the training image and the candidate region;

performing feature fusion on the first feature and the second feature to obtain a feature fusion result of the second specified convolutional layer;

and training the target detection model to be trained based on the feature fusion result.

2. The method of claim 1, wherein the step of inputting the training image and the candidate region into a key location resolution network to obtain a first feature output by the key location resolution network comprises:

inputting the training image and the candidate region into the key part analysis network, and acquiring a first feature output by a first designated convolutional layer of the key part analysis network aiming at the training image and the candidate region.

3. The method of claim 2, wherein the first designated convolutional layer and the second designated convolutional layer are plural in number, and the second designated convolutional layer corresponds to the first designated convolutional layer one to one;

the step of performing feature fusion on the first feature and the second feature to obtain a feature fusion result of the second specified convolutional layer includes:

and performing feature fusion on the second feature output by the second specified convolutional layer and the first feature output by the first specified convolutional layer corresponding to the second specified convolutional layer to obtain a feature fusion result of the second specified convolutional layer.

4. The method according to claim 3, wherein the step of training the target detection model to be trained based on the feature fusion result comprises:

respectively performing upsampling processing of different proportions on the feature fusion result of each second specified convolutional layer to obtain an upsampling result corresponding to each second specified convolutional layer, and performing feature fusion on each upsampling result to obtain an upsampling fusion feature;

respectively performing downsampling processing of different proportions on the feature fusion result of each second designated convolutional layer to obtain a downsampling result corresponding to each second designated convolutional layer, and performing feature fusion on each downsampling result to obtain downsampling fusion features;

performing regression processing on the up-sampling fusion characteristics to obtain a regression processing result; and carrying out classification processing on the downsampling fusion characteristics to obtain a classification processing result;

and training the target detection model to be trained based on the regression processing result and the classification processing result.

5. The method according to claim 1, wherein the candidate area detection network and the key part analysis network are both trained based on a training image of the target detection model; the training image also carries label information of a foreground region when the candidate region detection network is trained; the training image also carries the label information of the key part when training the key part analysis network.

6. A method of object detection, comprising:

acquiring an image to be detected;

inputting the image to be detected into a target detection model obtained by pre-training; wherein the target detection model is obtained by training by the method of any one of claims 1 to 5;

detecting the image to be detected through the target detection model to obtain a target detection result; the target detection result comprises position information of a target object in the image to be detected.

7. An apparatus for training an object detection model, comprising:

the training image acquisition module is used for acquiring a training image; the training image carries label information of a target object;

a candidate region detection module, configured to input the training image to a candidate region detection network to obtain a candidate region of the target object;

the first feature acquisition module is used for inputting the training image and the candidate region into a key part analysis network to obtain a first feature output by the key part analysis network;

the training module is used for training a target detection model to be trained through the training image, the candidate region and the first characteristic output by the key part analysis network to obtain a trained target detection model;

the training module is further configured to:

8. An object detection device, comprising:

the image acquisition module to be detected is used for acquiring an image to be detected;

the image to be detected input module is used for inputting the image to be detected into a target detection model obtained by pre-training; wherein the target detection model is obtained by training by the method of any one of claims 1 to 5;

the detection module is used for detecting the image to be detected through the target detection model to obtain a target detection result; the target detection result comprises position information of a target object in the image to be detected.

9. An electronic device comprising a processor and a memory;

the memory has stored thereon a computer program which, when executed by the processor, performs the method of any one of claims 1 to 5, or performs the method of claim 6.

10. A computer storage medium for storing computer software instructions for use in the method of any one of claims 1 to 5 or for performing the method of claim 6.