CN113269267A

CN113269267A - Training method of target detection model, target detection method and device

Info

Publication number: CN113269267A
Application number: CN202110663377.5A
Authority: CN
Inventors: 沈蓓; 韦松; 张兵; 李瑛�
Original assignee: Suzhou Zhitu Technology Co Ltd
Current assignee: Suzhou Zhitu Technology Co Ltd
Priority date: 2021-06-15
Filing date: 2021-06-15
Publication date: 2021-08-17
Anticipated expiration: 2041-06-15
Also published as: CN113269267B

Abstract

The invention provides a training method of a target detection model, a target detection method and a target detection device, which are used for acquiring a first image, a second image and an intermediate model; inputting the second image into the intermediate model, and outputting a first prediction result of the specified target object; and merging the second sub-image area corresponding to the specified target object into the first sub-image area of the first target object in the first image, and training an intermediate model based on the obtained synthesized image and the first image to obtain a target detection model. The synthesized image in the method is obtained by combining a second sub-image area corresponding to the specified target object in the second image with a first sub-image area except the first target object in the first image, namely, the specified target object in the synthesized image is the detected target object in the second image, so that no missing detection target exists in the synthesized image, the quality of a label carried by the synthesized image is improved, and the performance of a finally trained target detection model is improved.

Description

Training method of target detection model, target detection method and device

Technical Field

The invention relates to the technical field of neural networks, in particular to a training method of a target detection model, a target detection method and a target detection device.

Background

In the image processing, interested targets in an image can be detected through target detection, the positions and the categories of the targets in the image are determined, in the related technology, the target detection can be carried out in a semi-supervised learning mode, in the semi-supervised learning, a pseudo-label technology is usually adopted, namely, a semi-supervised model is preliminarily trained on the basis of a small number of labeled images, then pseudo labels of the interested targets in an unlabelled image are generated on the basis of the preliminarily trained semi-supervised model, finally, the preliminarily trained semi-supervised model is continuously trained on the basis of the unlabelled images with the obtained pseudo labels, and the finally trained semi-supervised model is obtained; therefore, the pseudo labels of all interested targets in the label-free image cannot be obtained, the quality of the pseudo labels is influenced, and the semi-supervised model is trained based on the pseudo labels with poor quality, so that the performance of the trained semi-supervised model is difficult to ensure.

Disclosure of Invention

The invention aims to provide a training method of a target detection model, a target detection method and a target detection device, so as to improve the performance of a semi-supervised model which is finally trained.

The invention provides a training method of a target detection model, which comprises the following steps: acquiring a first image containing a first target object, a second image containing a second target object and an intermediate model; the first image carries a position label and a category label of a first target object; the second image does not carry the position label and the category label of the second target object; the intermediate model is obtained by pre-training based on the first image; inputting the second image into the intermediate model, and outputting a first prediction result of a specified target object in the second image; wherein the first prediction result comprises a category prediction result and a position prediction result of the specified target object; merging a second sub-image area corresponding to the specified target object into a first sub-image area except the first target object in the first image to obtain a composite image based on the position label and the category label of the first target object and the first prediction result; the synthetic image carries a category prediction result of the specified target object and a corresponding position label of the specified target object in the synthetic image; and training the intermediate model based on the first image and the synthetic image to obtain a target detection model.

Further, the intermediate model is obtained by training in the following way: performing data enhancement processing on the first image, inputting the enhanced first image into the initial model, and outputting a second prediction result of the first target object in the enhanced first image through the initial model, wherein the second prediction result comprises: position information and category information of the first target object; calculating a first loss value of a second prediction result of the first target object based on the second prediction result and a preset first loss function; updating the weight parameters of the initial model based on the first loss values; and continuing to perform the step of performing data enhancement processing on the first image and inputting the enhanced first image into the initial model until the initial model converges to obtain an intermediate model.

Further, the second image comprises a plurality of second target objects; the step of inputting the second image into the intermediate model and outputting a first prediction result of the specified target object in the second image comprises: inputting the second image into the intermediate model to output a third prediction result of each second target object in the second image through the intermediate model; wherein the third prediction result comprises: a position prediction result, a category prediction result and a confidence of each second target object; and deleting the prediction result with the confidence coefficient smaller than the preset confidence coefficient threshold value from the third prediction result to obtain a first prediction result of the specified target object in the second image.

Further, the step of merging the second sub-image region corresponding to the specified target object into the first image except the first sub-image region of the first target object based on the position label and the category label of the first target object and the first prediction result to obtain the composite image includes: aiming at each first target object in the first image, acquiring a pixel value mean value of a pixel area corresponding to the first target object; replacing the pixel area of the first target object based on the pixel value mean value to obtain a replacement image; wherein the replacement image includes: dividing a first sub-image area of the first target object in the first image, and a position label and a category label of each first target object; acquiring a second sub-image area corresponding to the specified target object based on a first prediction result of the specified target object in the second image; and merging the second sub-image area into the replacement image based on the position label and the category label of the first target object and the first prediction result to obtain a composite image.

Further, the second image comprises a plurality of designated target objects, and each designated target object has a corresponding second sub-image area; merging the second sub-image region into the replacement image based on the position label and the category label of the first target object and the first prediction result to obtain a composite image, wherein the step of merging the second sub-image region into the replacement image comprises the following steps: aiming at the current second sub-image area, judging whether a target position of a specified target object corresponding to the current second sub-image area belongs to the same category exists in the replacement image or not based on the category label of the first target object and the category prediction result corresponding to the current second sub-image area; if the target position of the designated target object corresponding to the current second sub-image area in the replacement image belongs to the same category, placing the current second sub-image area at the target position; saving a position label of the target position and a category prediction result of the current second sub-image area; and if the target position of the specified target object corresponding to the current second sub-image area in the replacement image belongs to the same category does not exist, taking the next second sub-image area as a new current second sub-image area, continuously executing the step of judging whether the target position of the specified target object corresponding to the current second sub-image area in the replacement image belongs to the same category exists or not based on the category label of the first target object and the category prediction result corresponding to the current second sub-image area until the traversal of the plurality of second sub-image areas in the second image is completed, and obtaining a synthetic image.

Further, if there is a target position in the replacement image, where the designated target object corresponding to the current second sub-image area belongs to the same category, the step of placing the current second sub-image area at the target position includes: if the target position of the designated target object corresponding to the current second sub-image area in the replacement image belongs to the same category, judging whether the area size of the current second sub-image area exceeds the boundary area of the replacement image; if the area size of the current second sub-image area does not exceed the boundary area of the replacement image, placing the current second sub-image area at the target position; and if the size of the current second sub-image area exceeds the boundary area of the replacement image, continuing to execute the step of judging whether a target position of a specified target object corresponding to the current second sub-image area belongs to the same category exists in the replacement image until the current second sub-image area is placed at the target position.

Further, training the intermediate model based on the first image and the synthetic image to obtain the target detection model includes: respectively performing data enhancement processing on the first image and the synthetic image, inputting the enhanced first image and the enhanced synthetic image into the intermediate model, outputting a fourth prediction result of the first target object in the enhanced first image through the intermediate model, and outputting a fifth prediction result of the specified target object in the enhanced synthetic image; the fourth prediction result comprises position prediction information and category prediction information of the first target object; the fifth prediction result includes position prediction information and category prediction information of the specified target object; calculating a second loss value based on the fourth prediction result, the fifth prediction result and a preset second loss function; updating the weight parameters of the intermediate model based on the second loss values; and continuing to perform the step of respectively performing data enhancement processing on the first image and the synthetic image until the intermediate model converges to obtain the target detection model.

The invention provides a target detection method, which comprises the following steps: acquiring an image containing a target to be detected; inputting the image into a pre-trained target detection model, and outputting a detection result of a target to be detected; the detection result comprises the category and the position coordinate of the target to be detected; the pre-trained target detection model is obtained by training through the training method of the target detection model.

Further, the object detection model includes: the system comprises a feature extraction module, a region generation module and a positioning classification module; inputting the image into a pre-trained target detection model, and outputting a detection result of a target to be detected, wherein the step of outputting the detection result of the target to be detected comprises the following steps: inputting the image to a feature extraction module so as to output the target feature of the target to be detected through the feature extraction module; inputting the target characteristics into a regional generation network so as to output a candidate frame containing a target to be detected through the regional generation network; and inputting the target characteristics and the candidate box into a positioning classification module so as to output the category and the position coordinates of the target to be detected through the positioning classification module.

The invention provides a training device of a target detection model, which comprises: a first acquisition module for acquiring a first image containing a first target object, a second image containing a second target object, and an intermediate model; the first image carries a position label and a category label of a first target object; the second image does not carry the position label and the category label of the second target object; the intermediate model is obtained by pre-training based on the first image; the first output module is used for inputting the second image into the intermediate model and outputting a first prediction result of a specified target object in the second image; wherein the first prediction result comprises a category prediction result and a position prediction result of the specified target object; the merging module is used for merging a second sub-image area corresponding to the specified target object into the first image except the first sub-image area of the first target object based on the position label and the category label of the first target object and the first prediction result to obtain a synthesized image; the synthetic image carries a category prediction result of the specified target object and a corresponding position label of the specified target object in the synthetic image; and the training module is used for training the intermediate model based on the first image and the synthetic image to obtain a target detection model.

The invention provides a target detection device, comprising: the second acquisition module is used for acquiring an image containing a target to be detected; the second output module is used for inputting the image into a pre-trained target detection model and outputting a detection result of the target to be detected; the detection result comprises the category and the position coordinate of the target to be detected; the pre-trained target detection model is obtained through training of a training device of the target detection model.

The invention provides an electronic device, which comprises a processor and a memory, wherein the memory stores machine executable instructions capable of being executed by the processor, and the processor executes the machine executable instructions to realize the training method of the target detection model or the target detection method.

The present invention provides a machine-readable storage medium having stored thereon machine-executable instructions that, when invoked and executed by a processor, cause the processor to implement the above-described method of training an object detection model, or the above-described method of object detection.

The invention provides a training method of a target detection model, a target detection method and a device, which are used for acquiring a first image containing a first target object, a second image containing a second target object and an intermediate model; inputting the second image into the intermediate model, and outputting a first prediction result of a specified target object in the second image; merging a second sub-image area corresponding to the specified target object into a first sub-image area except the first target object in the first image to obtain a composite image based on the position label and the category label of the first target object and the first prediction result; and training the intermediate model based on the first image and the synthetic image to obtain a target detection model. The synthesized image in the method is obtained by combining a second sub-image area corresponding to the specified target object in the second image with a first sub-image area except the first target object in the first image, namely, the specified target object in the synthesized image is the detected target object in the second image, so that no missing detection target exists in the synthesized image, the quality of a label carried by the synthesized image is improved, and the performance of a finally trained target detection model is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of a training method of a target detection model according to an embodiment of the present invention;

FIG. 2 is a flowchart of another method for training a target detection model according to an embodiment of the present invention;

FIG. 3 is a flowchart of another method for training a target detection model according to an embodiment of the present invention;

fig. 4 is a flowchart of a target detection method according to an embodiment of the present invention;

fig. 5 is a schematic flowchart of a two-stage semi-supervised target detection method based on pseudo tag improvement according to an embodiment of the present invention;

FIG. 6 is a flow chart illustrating an improvement of a composite image and pseudo label according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a composite image according to an embodiment of the present invention;

FIG. 8 is a schematic structural diagram of a training apparatus for a target detection model according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Currently, deep learning models for target detection exhibit powerful superior performance, which benefits from large-scale labeled datasets and sufficient computational resources, but such supervised learning target detection algorithms rely heavily on the scale of labeled datasets, with high economic and temporal costs for labeling data. Semi-supervised learning uses unlabeled training data to improve the detector, which can greatly reduce labeling cost; currently, semi-supervised learning is mainly applied to an image classification task, and compared with other important problems in computer vision, if compared with target detection, the labeling cost of the image classification task is very low, mainly because only one target object is usually in one image in the image classification task and the target object is not required to be positioned, and in the target detection task, one image usually has a plurality of target objects to be detected and needs to be positioned, so that the semi-supervised learning of the target detection task has higher practical application value.

Data enhancement is crucial to semi-supervised learning, and not only can improve the generality and robustness of the model, but also is proved to be very effective in semi-supervised training based on consistency. Most data enhancement methods research has focused primarily on the area of image classification, including combination strategies ranging from manual combination of elemental image transformations (e.g., rotation, translation, flipping, or color dithering), to neuro-image synthesis and learning by reinforcement learning, and the like. However, the complexity of data enhancement suitable for target detection is much higher than that of image classification, for example, global geometric transformation on data may affect labeling data of a bounding box, and thus, the data enhancement method applied in image classification is less suitable for a target detection algorithm.

In addition, using a pseudo label is also an effective method in semi-supervised learning, a pseudo label is generally generated for label-free data through a model trained by a small amount of labeled data, and semi-supervised training is performed by means of the pseudo label and the label-free data, however, false labels obtained through model training may have missing detection, which may cause category imbalance, for example, some categories are better detected, and some categories are not detected, which may cause category imbalance; there may also be a problem of inaccurate positioning, which results in poor quality of the pseudo tag and further causes a large deviation of the model after multiple iterations. In addition, in the pseudo label screening of semi-supervised learning, the obtained pseudo labels are usually required to be filtered, too high threshold value leads to many missed targets of the pseudo labels without label data, too low threshold value leads to false detection targets, so that threshold value parameter adjustment of confidence coefficient is difficult, and the filtered classes are usually concentrated on some targets which are easy to detect, and the targets which are difficult to detect are filtered out due to low confidence coefficient, so that the class imbalance is further caused, and the performance of the finally trained semi-supervised model is influenced.

In the related art, the data enhancement method applied to target detection mainly comprises the following steps, for example, three elements of three-dimensional change, lens distortion and illumination change can be added in the traditional data enhancement method to realize the data enhancement of the target detection, but the data enhancement mode has limited effect on the target detection task and can increase the model learning difficulty for complex scenes; and an enhancement method adopting infrared image data is adopted, a required image is generated by utilizing image conversion, the generated countermeasure network is constructed to be used as an infrared image generator, and an input color image is converted from a color domain to an infrared domain. In addition, the other data enhancement method adopts the pasting in a mask mode, but the pasting position in the mode has an unreasonable problem and is not applied to a semi-supervised method. In other semi-supervised methods, a pseudo label method and data enhancement of turning and cutting are used, but the effect of the data enhancement of turning and cutting on the target detection model is improved only in a limited way.

Based on this, the embodiment of the invention provides a training method of a target detection model, a target detection method and a device, and the technology can be applied to target detection application of images, especially to training of a semi-supervised model for target detection.

In order to facilitate understanding of the embodiment, first, a method for training a target detection model disclosed in the embodiment of the present invention is described in detail; as shown in fig. 1, the method comprises the steps of:

step S102, acquiring a first image containing a first target object, a second image containing a second target object and an intermediate model; the first image carries a position label and a category label of a first target object; the second image does not carry the position label and the category label of the second target object; the intermediate model is obtained by pre-training based on the first image.

The first image and the second image may be sample images obtained from a preset training sample set, for example, the first image and the second image may be selected from a cityscaps target detection data set, the cityscaps target detection data set has 5000 images of driving scenes in an urban environment, and each image of the data set generally includes various types of targets, such as vehicles, pedestrians, traffic lights and the like; a training set in the cityscaps target detection data set can be randomly divided into 10% of data sets, a position label and a category label are labeled on a first target object in the 10% of data sets, the first target object is used as a set of first images, namely a labeled data set, the remaining 90% of data sets in the training set are used as a set of second images, namely a non-labeled data set, and the second images do not carry the position label and the category label of the second target object; the first image and the second image can also be respectively obtained from different training sample sets; the first target object may be an object of interest in a first image and the second target object may be an object of interest in a second image; the number of the first target objects can be multiple, and the number of the second target objects can also be multiple; the position label is used for indicating a position area of the first target object in the first image, and the position label can be expressed in the form of position coordinates; the category label is used for indicating the category of the first target object, such as a car, a bus, a pedestrian, a bicycle or a truck; the above intermediate model, which may also be referred to as a supervised object detection model, may be trained in advance based on the first image, i.e. the labeled image. In practical implementation, when a target detection model needs to be trained, a first image containing a first target object is generally acquired; a second image containing a second target object, and an intermediate model.

Step S104, inputting the second image into the intermediate model, and outputting a first prediction result of a specified target object in the second image; wherein the first prediction result comprises a category prediction result and a location prediction result of the specified target object.

The specified target object may be at least a part of the second target object in the second image, or may be all of the second target object; the category prediction result may be used to predict a category to which the designated target object belongs, and the position prediction result may be used to predict a position area of the designated target object in the second image; in practical implementation, after an intermediate model is obtained based on training of a first image, namely, a labeled image, a second image, namely, an unlabeled image, may be input to the intermediate model to output a category prediction result and a position prediction result of a specified target object in the second image; the confidence of the category prediction result and the location prediction result for the specified target object is typically relatively high.

Step S106, merging a second sub-image area corresponding to the specified target object into the first image except the first sub-image area of the first target object based on the position label and the category label of the first target object and the first prediction result to obtain a synthesized image; the synthetic image carries a category prediction result of the specified target object and a corresponding position label of the specified target object in the synthetic image.

The second sub-image area can be understood as an image area range occupied by the specified target object in the second image; the first sub-image area may be understood as a background image area in the first image except for the first target object; in actual implementation, the second sub-image region corresponding to the specified target object may be merged into the first sub-image region except the first target object in the first image based on the position tag and the category tag of the first target object and the first prediction result to obtain a composite image, for example, the position of the first target object belonging to the same category may be selected from the first image according to the category prediction result in the first prediction result of the specified target object, and the specified target object may be filled in the position of the selected first target object; the obtained composite image usually carries the filled type prediction result of the specified target object and the corresponding position tag of the specified target object in the composite image, i.e. the position tag of the first target object selected from the first image and belonging to the same type as the specified target object.

And step S108, training an intermediate model based on the first image and the synthetic image to obtain a target detection model.

In practical implementation, after the synthetic image is obtained, the intermediate model may be trained continuously based on the first image and the obtained synthetic image, so as to obtain a finally trained target detection model; for example, after the first image and the composite image are subjected to data enhancement processing such as turning and rotating, the enhanced first image and the enhanced composite image are input into the intermediate model, so as to continue training the intermediate model, and obtain the target detection model, which is a semi-supervised target detection model.

The training method of the target detection model obtains a first image containing a first target object, a second image containing a second target object and an intermediate model; inputting the second image into the intermediate model, and outputting a first prediction result of a specified target object in the second image; merging a second sub-image area corresponding to the specified target object into a first sub-image area except the first target object in the first image to obtain a composite image based on the position label and the category label of the first target object and the first prediction result; and training the intermediate model based on the first image and the synthetic image to obtain a target detection model. The synthesized image in the method is obtained by combining a second sub-image area corresponding to the specified target object in the second image with a first sub-image area except the first target object in the first image, namely, the specified target object in the synthesized image is the detected target object in the second image, so that no missing detection target exists in the synthesized image, the quality of a label carried by the synthesized image is improved, and the performance of a finally trained target detection model is improved.

The following introduces a training method of the intermediate model, which can be specifically realized by the following steps one to three:

step one, performing data enhancement processing on a first image, inputting the enhanced first image into an initial model, and outputting a second prediction result of a first target object in the enhanced first image through the initial model, wherein the second prediction result comprises: location information and category information of the first target object.

For convenience Of description, taking the fast RCNN algorithm model as an example, the fast RCNN algorithm model may include a backbone Network Of ResNet50 (where english Of ResNet is fully called Residual Network and chinese is Residual Network), and two Network modules Of RPN (Region pro-active Network) and ROI Head (Head Network Of Region Of Interest; where english Of ROI is fully called Region Of Interest and chinese Of ROI is Region Of Interest); wherein, the backbone network of ResNet50 is responsible for extracting image features, and the initialization of the network adopts pre-training weights on ImageNet data sets; the RPN network module is used for screening out candidate boxes, wherein the candidate boxes are usually candidate boxes possibly containing target objects, and the number of the candidate boxes is usually far larger than the number of real target objects; the ROI Head is used for fine positioning and classification of a target object; specifically, the upper left corner coordinates, the lower right corner coordinates and the like of the candidate frame of the target object can be adjusted through the ROI Head so as to realize the fine positioning of the target object; the ROI Head can also output the category of the target object; specifically, the first image is input to a ResNet50 backbone network of the initial model, an output end of the ResNet50 backbone network is connected to input ends of an RPN and an ROI Head, respectively, an output end of the RPN is also connected to an input end of the ROI Head, and the ROI Head outputs position region information of the first image, which is predicted to be occupied by the first target object in the first image, and a category to which the first target object belongs based on image features output by the ResNet50 backbone network and a candidate frame output by the RPN.

And secondly, calculating a first loss value of the second prediction result of the first target object based on the second prediction result and a preset first loss function.

The first loss function may also be referred to as a supervised loss function, and may be:

wherein,

represents a classification loss of the RPN;

represents the regression loss of the RPN;

representing a classification loss of the ROI;

representing the regression loss of the ROI;

indicating a tagged image;

a label representing a labeled image;

the first loss value can be understood as a difference between the second prediction result of the first target object and the real label of the first target object; in actual implementation, after the second prediction result is obtained, a first loss value corresponding to the second prediction result is calculated according to the second prediction result and a preset first loss function.

Updating the weight parameters of the initial model based on the first loss value; and continuing to perform the step of performing data enhancement processing on the first image and inputting the enhanced first image into the initial model until the initial model converges to obtain an intermediate model.

The weight parameters may include all parameters in the initial model, such as convolution kernel parameters, and when the initial model is trained, all parameters in the initial model are generally updated based on the second prediction result of the first target object and the true label of the first target object, so as to train the initial model, and then the step of inputting the first image into the initial model is continuously performed until the initial model converges, or the first loss value converges, so as to finally obtain a trained intermediate model; for example, the obtained labeled data set can be input to the fast RCNN algorithm network after simple inversion and rotation data enhancement based on 10% of data sets randomly divided from the training set of the cityscaps target detection data set, and iterative training is performed until the initial model converges, so that a supervised target detection model, namely the intermediate model, is obtained.

The embodiment of the invention provides another training method of a target detection model, which is realized on the basis of the method of the embodiment, wherein in the method, a second image comprises a plurality of second target objects; the second image comprises a plurality of specified target objects, and each specified target object is provided with a corresponding second sub-image area; in practical implementation, the plurality of designated target objects are usually at least a part of the plurality of second target objects, for example, 10 second target objects are included in the second image, and the designated target object may be 6 target objects out of the 10 second target objects or may be all the second target objects; as shown in fig. 2, the method comprises the steps of:

step S202, acquiring a first image containing a first target object, a second image containing a second target object and an intermediate model; the first image carries a position label and a category label of a first target object; the second image does not carry the position label and the category label of the second target object; the intermediate model is obtained by pre-training based on the first image.

Step S204, inputting the second image into the intermediate model so as to output a third prediction result of each second target object in the second image through the intermediate model; wherein the third prediction result comprises: a location prediction result, a category prediction result, and a confidence for each second target object.

In practical implementation, after the intermediate model is obtained based on the first image training, the second image may be input into the intermediate model to output a position prediction result, a category prediction result, and a confidence of each second target object in the second image, that is, to output a pseudo tag of each second target object in the second image; the output position prediction result of each second target object may be used to indicate a corresponding prediction position of each second target object in the second image, for example, the upper-left corner coordinate and the lower-right corner coordinate of the candidate frame corresponding to each second target object may be output; the output class prediction result of each second target object can be used for indicating the class to which the prediction of each second target object corresponds belongs; the confidence level of each second target object output is used to indicate the possibility that the prediction is correct for the position and the category of the second target object, and is usually a probability value, and the higher the confidence level value is, the higher the possibility that the prediction is correct is.

And step S206, deleting the prediction result with the confidence coefficient smaller than the preset confidence coefficient threshold value from the third prediction result to obtain a first prediction result of the specified target object in the second image.

The confidence threshold may be set according to actual requirements, for example, the confidence threshold may be set to be 0.7; in practical implementation, after the third prediction results of each second target object in the second image are obtained, since the confidence degrees of the third prediction results of each second target object are usually different, some confidence degrees have higher values and some confidence degrees have lower values, the third prediction result with the confidence degree smaller than the preset confidence degree threshold value may be deleted from the obtained third prediction results of each second target object, so as to obtain the first prediction result of the designated target object in the second image, that is, the designated target object is a target object with the confidence degree not lower than the preset confidence degree threshold value in the second image, and the designated target object may be a part of the second target objects or may be all the second target objects. Usually, after the third prediction result of each second target object is obtained, because there may be multiple overlapping candidate frames in each third prediction result, the overlapping candidate frames may be removed in a Non-Maximum Suppression NMS (Non-Maximum Suppression) manner, and certainly, the overlapping candidate frames may also be removed in other manners, and then the prediction result with lower confidence and the corresponding target object are filtered out by a preset confidence threshold, and the target object corresponding to the prediction result with lower confidence may be a false-detected or a false-detected target object, and if the method is not used for screening the false label, the quality of the false label is poor.

Step S208, for each first target object in the first image, obtaining a pixel value mean of a pixel region corresponding to the first target object.

The pixel region corresponding to each first target object generally includes a plurality of pixel points, and the pixel value average value may be an average value of pixel values corresponding to each pixel point in the pixel region; in practical implementation, for each first target object in the first image, the mean value of the pixel values of the pixel region corresponding to each first target object may be determined based on the pixel value of each pixel point in the pixel region corresponding to each first target object.

Step S210, replacing the pixel area of the first target object based on the pixel value mean value to obtain a replacement image; wherein the replacement image includes: the first sub-image area of the first target object in the first image is divided, and the position label and the category label of each first target object.

In actual implementation, image pixels of pixel regions corresponding to all first target objects in the first image may be removed, and the obtained pixel value mean value corresponding to each first target object is filled into the pixel region corresponding to the corresponding first target object, so as to obtain a first image only including a first sub-image region except the first target object, and a position tag and a category tag of each first target object, where the first sub-image region may also be referred to as a background pixel region, and each first target object may also be referred to as a foreground object.

In step S212, a second sub-image region corresponding to the designated target object is acquired based on the first prediction result of the designated target object in the second image.

In practical implementation, the second sub-image region corresponding to each designated target object in the second image may be cut out from the second image according to the first prediction result of the designated target object in the second image, that is, according to the category prediction result and the position prediction result of the designated target object, so as to obtain a plurality of second sub-image regions corresponding to the designated target objects.

Step S214, merging the second sub-image region into the replacement image based on the position label and the category label of the first target object and the first prediction result, so as to obtain a composite image.

This step S214 can be specifically realized by the following steps eleven to fourteen:

and eleventh, judging whether a target position of a specified target object corresponding to the current second sub-image area belongs to the same category exists in the replacement image or not according to the category label of the first target object and the category prediction result corresponding to the current second sub-image area.

As there may be a plurality of designated target objects, each designated target object has a corresponding second sub-image region, and for each second sub-image region, a target position of the designated target object corresponding to the second sub-image region belonging to the same category may be sequentially selected from the replacement image, and specifically, it may be determined whether there is a target position of the designated target object corresponding to the current second sub-image region belonging to the same category in the replacement image based on the category label of the first target object and the category prediction result corresponding to the current second sub-image region.

And step twelve, if the specified target object corresponding to the current second sub-image area in the replacement image belongs to the target position of the same category, placing the current second sub-image area at the target position.

The step twelve can be realized by the following steps a to C:

and step A, if a target position of the specified target object corresponding to the current second sub-image area in the replacement image belongs to the same category, judging whether the area size of the current second sub-image area exceeds the boundary area of the replacement image.

And B, if the area size of the current second sub-image area does not exceed the boundary area of the replacement image, placing the current second sub-image area at the target position.

The boundary region may also be referred to as an image boundary, and may be used to indicate the image range size of the replacement image; the current second sub-image area may be filled in the target position if the size of the current second sub-image area does not exceed the boundary area of the replacement image.

And C, if the size of the current second sub-image area exceeds the boundary area of the replacement image, continuing to execute the step of judging whether the designated target object corresponding to the current second sub-image area belongs to the target position of the same category in the replacement image until the current second sub-image area is placed at the target position.

And if the size of the current second sub-image area exceeds the image boundary of the replacement image, selecting the next target position of the same category until a proper target position is found, and filling the current second sub-image area into the found target position.

And thirteen, storing the position label of the target position and the category prediction result of the current second sub-image area.

After the current second sub-image area is placed at the target position, the category prediction result of the specified target object corresponding to the current second sub-image area and the position tag of the filled target position may be recorded, where the position tag corresponds to the position tag of the first target object originally corresponding to the target position.

And step fourteen, if the target position of the specified target object corresponding to the current second sub-image area in the replacement image belongs to the same category does not exist, taking the next second sub-image area as a new current second sub-image area, continuing to execute the step of judging whether the target position of the specified target object corresponding to the current second sub-image area in the replacement image belongs to the same category exists or not based on the category label of the first target object and the category prediction result corresponding to the current second sub-image area until the traversal of the plurality of second sub-image areas in the second image is completed, and obtaining the synthetic image.

If the target position of the specified target object corresponding to the current second sub-image area, which belongs to the same category, does not exist in the replacement image, filling the current second sub-image area is abandoned, the next second sub-image area is used as a new current second sub-image area, the steps from the first step to the fourth step are repeatedly executed until traversing of a plurality of second sub-image areas is completed, a composite image is obtained, the category prediction result and the position label corresponding to each second sub-image area recorded through the steps can be used as a pseudo label of the composite image, and the composite image and the pseudo label of the composite image are output.

And S216, training an intermediate model based on the first image and the synthetic image to obtain a target detection model.

The training method of the target detection model obtains a first image containing a first target object, a second image containing a second target object and an intermediate model; inputting the second image into the intermediate model to output a third prediction result of each second target object in the second image through the intermediate model; and deleting the prediction result with the confidence coefficient smaller than the preset confidence coefficient threshold value from the third prediction result to obtain a first prediction result of the specified target object in the second image. For each first target object in the first image, obtaining a pixel value mean value of a pixel area corresponding to the first target object. Replacing the pixel area of the first target object based on the pixel value mean value to obtain a replacement image; and acquiring a second sub-image area corresponding to the specified target object based on a first prediction result of the specified target object in the second image. And merging the second sub-image area into the replacement image based on the position label and the category label of the first target object and the first prediction result to obtain a composite image. And training the intermediate model based on the first image and the synthetic image to obtain a target detection model. The synthesized image in the method is obtained by combining a second sub-image area corresponding to the specified target object in the second image with a first sub-image area except the first target object in the first image, namely, the specified target object in the synthesized image is the detected target object in the second image, so that no missing detection target exists in the synthesized image, the quality of a label carried by the synthesized image is improved, and the performance of a finally trained target detection model is improved.

The embodiment of the present invention further provides another training method for a target detection model, which is implemented on the basis of the method of the above embodiment, as shown in fig. 3, the method includes the following steps:

step S302, acquiring a first image containing a first target object, a second image containing a second target object and an intermediate model; the first image carries a position label and a category label of a first target object; the second image does not carry the position label and the category label of the second target object; the intermediate model is obtained by pre-training based on the first image.

Step S304, inputting the second image into the intermediate model, and outputting a first prediction result of a specified target object in the second image; wherein the first prediction result comprises a category prediction result and a location prediction result of the specified target object.

Step S306, merging a second sub-image area corresponding to the specified target object into the first image except the first sub-image area of the first target object based on the position label and the category label of the first target object and the first prediction result to obtain a synthesized image; the synthetic image carries a category prediction result of the specified target object and a corresponding position label of the specified target object in the synthetic image.

Step S308, respectively performing data enhancement processing on the first image and the synthetic image, inputting the enhanced first image and the enhanced synthetic image into an intermediate model, so as to output a fourth prediction result of the first target object in the enhanced first image through the intermediate model, and output a fifth prediction result of the specified target object in the enhanced synthetic image; the fourth prediction result comprises position prediction information and category prediction information of the first target object; the fifth prediction result includes position prediction information and category prediction information of the specified target object.

The training of the intermediate model is continued based on the first image and the synthesized image, and specifically, the data enhancement processing may be performed on the first image and the synthesized image, respectively, the enhanced first image and the synthesized image may be input to the intermediate model, the position prediction information and the category prediction information of the first target object in the enhanced first image may be output through the intermediate model, and the position prediction information and the category prediction information of the specified target object included in the enhanced synthesized image may be output.

And step S310, calculating a second loss value based on the fourth prediction result, the fifth prediction result and a preset second loss function.

The second loss function may be trained by weighting a supervised loss function and an unsupervised loss function, where the unsupervised loss function and the supervised loss function are the same, and the second loss function is as follows:

L_sup+α*L_unsup

wherein L is_supFor supervised loss functions, L_supIs the expression of the first loss function, L_unsupIs an unsupervised loss function, alpha is the weight of the unsupervised loss function, and the experiment proves that when the value of alpha is 2, the model training effect is achievedThe result is relatively good, and therefore, a value of 2 may be considered as a preferable value. L above_supCorrespondingly supervise said first image, said L_unsupCorresponding to the composite image described above.

Step S312, updating the weight parameter of the intermediate model based on the second loss value; and continuing to perform the step of respectively performing data enhancement processing on the first image and the synthetic image until the intermediate model converges to obtain a target detection model.

The weight parameters may include all parameters in the intermediate model, such as convolution kernel parameters, and when the intermediate model is trained, all parameters in the intermediate model generally need to be updated based on the second loss value to train the intermediate model, and then the step of performing data enhancement processing on the first image and the synthesized image respectively continues until the intermediate model converges or the second loss value converges, and finally the trained target detection model is obtained.

The target detection problem usually has a data imbalance problem, including imbalance between foreground and background and imbalance between target categories, where the imbalance between foreground and background can be understood as that there are few foreground and many background in a candidate frame; the foreground generally refers to a target object in the candidate frame, and the background generally refers to a background part of the candidate frame except for the foreground; the problem of category imbalance can be well relieved through the Focal local, the Focal local can be used for solving the problem of serious imbalance of the proportion of positive samples and negative samples in target detection, the Loss function of the Focal local can reduce the weight occupied by a large number of simple negative samples in training, and can also be understood as difficult sample mining, more Loss weights can be applied to the samples under the condition of low confidence coefficient, the model can be concentrated on the difficult samples instead of the easier samples, therefore, in the ROI Head classifier, the multiple categories of Focal local are used for replacing the standard cross entropy CE (cross entropy), and the deviation of the model can be well reduced; wherein, the function expression of the standard cross entropy is as follows:

CE(p，y)＝CE(p_t)＝-log(p_t).

the Loss function expression for Focal local is as follows:

FL(p_t)＝-α_t(1-p_t)^γlog(p_t).

wherein p represents the predicted value of the model; y represents the true value; p is a radical of_tThe probability representing the model prediction value, i.e., the confidence in the model prediction value, is typically a range value between 0 and 1; alpha is alpha_tAnd gamma is a specific parameter of Focal local, can be used for controlling the weight of different samples, and can set alpha according to actual requirements_tAnd the value of gamma.

The intermediate model can be trained by utilizing the Loss function of the Focal local, the first image set and the synthetic image set until the intermediate model converges, and the problem of category imbalance in semi-supervised learning is well relieved by the improvement of the Loss function.

The training method of the target detection model obtains a first image containing a first target object, a second image containing a second target object and an intermediate model; inputting the second image into the intermediate model, and outputting a first prediction result of a specified target object in the second image; merging a second sub-image area corresponding to the specified target object into a first sub-image area except the first target object in the first image to obtain a composite image based on the position label and the category label of the first target object and the first prediction result; respectively performing data enhancement processing on the first image and the synthetic image, inputting the enhanced first image and the enhanced synthetic image into the intermediate model, outputting a fourth prediction result of the first target object in the enhanced first image through the intermediate model, and outputting a fifth prediction result of the specified target object in the enhanced synthetic image; and calculating a second loss value based on the fourth prediction result, the fifth prediction result and a preset second loss function. Updating the weight parameters of the intermediate model based on the second loss values; and continuing to perform the step of respectively performing data enhancement processing on the first image and the synthetic image until the intermediate model converges to obtain a target detection model. The synthesized image in the method is obtained by combining a second sub-image area corresponding to the specified target object in the second image with a first sub-image area except the first target object in the first image, namely, the specified target object in the synthesized image is the detected target object in the second image, so that no missing detection target exists in the synthesized image, the quality of a label carried by the synthesized image is improved, and the performance of a finally trained target detection model is improved.

An embodiment of the present invention provides a target detection method, as shown in fig. 4, the method includes the following steps:

step S402, acquiring an image containing the target to be detected.

The image may be an image captured by a video camera or a camera, or the image may be a pre-stored image or the like; the target to be detected can be an interested target in the image, such as a vehicle, a pedestrian, a traffic signal lamp and the like; in practical implementation, when the target needs to be detected, an image including the target to be detected is usually acquired first.

Step S404, inputting the image into a pre-trained target detection model, and outputting a detection result of a target to be detected; the detection result comprises the category and the position coordinate of the target to be detected; the pre-trained target detection model is obtained by training through the training method of the target detection model in the embodiment.

In practical implementation, after the image containing the target to be detected is obtained, the image can be input into a pre-trained target detection model to output a detection result of the target to be detected, and the target detection model is obtained by training an intermediate model based on the first image and the synthetic image; the first image carries a position label and a category label of the first target object in the first image; the intermediate model is obtained by pre-training based on the first image; the synthesized image is obtained by merging a second sub-image region corresponding to the specified target object into the first image except the first sub-image region of the first target object based on the position label and the category label of the first target object and a first prediction result, and the first prediction result is the category prediction result and the position prediction result of the specified target object in the output second image after the second image is input into the intermediate model, wherein the second image does not carry the position label and the category label of the contained second target object.

The synthesized image in the method is obtained by combining a second sub-image area corresponding to the specified target object in the second image with a first sub-image area except the first target object in the first image, namely, the specified target object in the synthesized image is the detected target object in the second image, so that no missing detection target exists in the synthesized image, the quality of the label carried by the synthesized image is improved, and the performance of the finally trained target detection model is improved.

The target detection method comprises the steps of firstly, obtaining an image containing a target to be detected; then inputting the image into a pre-trained target detection model, and outputting a detection result of the target to be detected; the pre-trained target detection model is obtained by training based on the method in the embodiment, so that the performance of the target detection model is better, and the accuracy of the detection result of the target to be detected in the image can be improved.

The embodiment of the invention provides another target detection method, which is realized on the basis of the method of the embodiment, wherein the target detection model comprises the following steps: the system comprises a feature extraction module, a region generation module and a positioning classification module; the feature extraction module can be used for extracting image features; the region generation module can be used for screening out candidate frames which possibly contain the target to be detected; the positioning classification module can be used for carrying out precise positioning on the target to be detected and determining the category of the target to be detected; the method comprises the following steps:

step 502, an image containing a target to be detected is acquired.

Step 504, inputting the image to the feature extraction module, so as to output the target feature of the target to be detected through the feature extraction module.

The target features may include a color, a shape, a size, and the like of the target to be detected, and in actual implementation, after an image including the target to be detected is obtained, the image is usually input to a feature extraction module in a target detection model to extract the target features of the target to be detected in the image.

Step 506, inputting the target characteristics into the area generation network, so as to output the candidate frame containing the target to be detected through the area generation network.

After the target feature is extracted, the target feature may be input to the area generation network to output candidate frames that may include the target to be detected, where the number of the candidate frames may be multiple, and the sizes of the multiple candidate frames are usually different.

And step 508, inputting the target characteristics and the candidate box containing the target to be detected into the positioning and classifying module, so as to output the category and the position coordinates of the target to be detected through the positioning and classifying module.

Inputting the target feature and the candidate frame into a positioning classification module, outputting the category of the target to be detected based on the target feature and the candidate frame, determining the best matched candidate frame containing the target to be detected from a plurality of candidate frames, adjusting the position of the candidate frame to realize fine positioning, and particularly outputting the upper left corner coordinate and the lower right corner coordinate of the best matched candidate frame containing the target to be detected.

The target detection method acquires an image containing a target to be detected. And inputting the image to a feature extraction module so as to output the target features of the target to be detected through the feature extraction module. And inputting the target characteristics into the area generation network so as to output a candidate frame containing the target to be detected through the area generation network. And inputting the target characteristics and the candidate frame containing the target to be detected into the positioning classification module so as to output the category and the position coordinates of the target to be detected through the positioning classification module. According to the method, the category and the position coordinate of the target to be detected can be accurately output through the feature extraction module, the area generation network and the positioning classification module, the detection of the target to be detected is realized, and the detection accuracy is improved.

To further understand the above embodiments, the following provides a schematic flow chart of a two-stage semi-supervised target detection method based on pseudo tag improvement as shown in fig. 5, including the following steps: in the first stage, a training set of a target detection data set is randomly divided into a 10% labeled data set (corresponding to the set of the first images) and a 90% unlabeled data set (corresponding to the set of the second images), a supervised model is trained based on the labeled data set by using a fast RCNN target detection algorithm, and the supervised model is usually trained after data enhancement processing is performed on the labeled images in the labeled data set until the supervised model converges (corresponding to the intermediate model).

And in the second stage, a supervised model is used for reasoning a non-label image to obtain a predicted target position, a type and a confidence coefficient of a target object in the non-label image, and a confidence coefficient threshold method is used for filtering unreliable predicted values to obtain a pseudo label of at least a part of non-label data (corresponding to the above, the second image is input into the intermediate model to output a third predicted result of each second target object in the second image through the intermediate model, wherein the third predicted result comprises a position predicted result, a type predicted result and a confidence coefficient of each second target object, and the predicted result with the confidence coefficient smaller than a preset confidence coefficient threshold is deleted from the third predicted result to obtain a first predicted result of a specified target object in the second image).

And generating a synthetic image and a new pseudo label (corresponding to the position label and the category label based on the first target object and the first prediction result, merging a second sub-image region corresponding to the specified target object into a first sub-image region except the first target object in the first image to obtain the synthetic image, wherein the synthetic image carries the category prediction result of the specified target object and the corresponding position label of the specified target object in the synthetic image) by using the obtained pseudo label of at least part of non-label data and the background image with the foreground removed from the label data. And training a semi-supervised model based on the mixed data set by using the labeled image, the synthetic image, a data enhancement strategy through turning and rotation and a loss function of semi-supervised learning, and also using a fast RCNN target detection algorithm until the model converges. (based on the first image and the composite image, training an intermediate model to obtain a target detection model).

The method can solve the problem of target missing in the pseudo label by utilizing two stages of training steps and splicing the background with the label data and the foreground without the label data. Therefore, the quality of the label can be improved, and the performance of the finally trained semi-supervised model is improved.

Referring to fig. 6, a flow chart of a composite image and pseudo tag improvement is shown, a cityscaps target detection data set can be selected, including picture and tag data, and detection categories include 10 categories of cars, buses, people, bicycles, trucks, and the like. And randomly dividing 10% of data in the training set into labeled data sets, using the rest 90% of data as unlabeled data, and simulating semi-supervised training setting. And inputting the 10% labeled data into a fast RCNN network through simple inversion and rotation data enhancement, and performing iterative training until the network converges to obtain a supervised target detection model. And inputting the image in the label-free data set into the obtained supervised target detection model, so as to obtain the pseudo labels of at least one part of target objects in the image.

Next, a method for generating a composite image and a pseudo tag thereof is introduced, as shown in fig. 6, an image is randomly selected from a tagged data set, image pixels of all foreground objects of the tagged image are removed, and an average value of original pixels is taken in each object frame to fill a pixel area of the foreground object, so as to obtain an image only including background pixels and position information of each foreground object. And for each non-label image, cutting the image pixel block of each target object in the pseudo label from the non-label image according to the obtained pseudo label information to obtain a plurality of image pixel blocks in the target frame corresponding to all the pseudo labels. And for each image pixel block, sequentially selecting the target position in the same category as the target from the labeled image without the foreground target. The composite image made in this way is more vivid, for example, the position of the vehicle originally appears, and the vehicle can be filled at present, but the targets of other categories such as traffic lights and the like cannot be filled.

Filling the image pixel block into the labeled image without the foreground target, judging whether the image pixel block exceeds the image boundary of the labeled image or not, if so, selecting the next same-class target position until a proper filling position is found, filling the image pixel block into the found target position, namely filling the target object corresponding to the pseudo label into the target position, and recording the class and position information as the pseudo label of the synthetic image. If no suitable position exists, abandoning filling of the target object corresponding to the pseudo label, continuing to select the target frame from the non-label image in sequence, judging whether the target frame is selectable, if the target frame is selectable, indicating that a plurality of image pixel blocks in the target frame corresponding to all the pseudo labels are traversed, outputting a synthesized image and a new pseudo label, if the target frame is selectable, continuing to select the target position of the same category in sequence, and judging whether the image boundary of the labeled image is exceeded.

Referring to a schematic diagram of a composite image shown in fig. 7, in the composite image shown in fig. 7, an image of a car is attached to a target position a, and an image of a pedestrian is attached to a target position B, wherein an original target object in the target position a also belongs to the category of cars, an original target object in the target position B also belongs to the category of pedestrians, a region in the target position a is filled with the same type of pixels, and a region in the target position B is also filled with the same type of pixels.

In the related technology, if the semi-supervised model is trained by directly adopting the pseudo labels of the non-label data, the pseudo labels of the non-label data are generated based on the supervised target detection model, so that a considerable number of missed targets still exist in the background of the generated pseudo label data, and the foreground is also considered as the background by utilizing the semi-supervised model trained by the pseudo labels, so that the performance index of the semi-supervised model is damaged; in addition, filtering by thresholding may also result in missed detections. Therefore, the invention generates a new image and the pseudo label thereof by using a synthetic image method, and can relieve the problem of poor performance of the semi-supervised model caused by the low-quality pseudo label. The two-stage semi-supervised target detection method can meet the actual detection requirement only by using a small amount of labeled data. The semi-supervised target detection method based on data enhancement and the pseudo label can effectively improve the accuracy of the model under the condition of not increasing the labeling cost, and has strong practicability and feasibility. Meanwhile, the cost of data marking is reduced, and the research and development efficiency in the field of automatic driving is improved; through the combination of the labeled data and the unlabeled data, the semi-supervised target detection algorithm can fully utilize the useful information of the unlabeled data, and compared with a supervised model, the semi-supervised model has higher prediction accuracy.

The method adopts the fusion of the pseudo label of the non-label data and the background of the label data to generate a new synthetic image and an improved pseudo label for the training of semi-supervised learning. According to the method for generating the synthetic image, only the target positions of the same category in the background image are selected as filling, and the average value processing is carried out on the blank area, so that the manufactured false image is more vivid. Through improvement of the synthetic image and the pseudo label, parameter adjustment complexity of a confidence coefficient threshold value can be reduced, and the problem that the target is missed to be detected and the model training effect is influenced due to overhigh threshold value is avoided, so that the quality of the pseudo label is improved, and the performance of the model can be improved; if the synthesized image is not used, the target is missed to be detected due to the fact that the threshold value is too high, and a lot of false detection is introduced due to the fact that the threshold value is too low, so that the parameter adjustment is troublesome. By adopting the synthetic image, the missing detection caused by the overhigh threshold value is not worried because the target pasted in the synthetic image is the target with higher confidence coefficient, and the foreground image in the labeled image is the pixel average value which is equivalent to that the background of the synthetic image is a pure background, and the false label in the synthetic image is basically the same as the artificial label.

For example, the unlabeled image has 10 objects originally, 6 objects are detected, 4 missed detections are performed, and the detected 6 objects are attached to a new background image to obtain a composite image, wherein the composite image has 6 objects, and the pseudo labels are also 6 objects, which corresponds to no missed detection. In addition, a Focal local Loss function is introduced into semi-supervised learning, so that the problem of data imbalance in a semi-supervised model, namely foreground and background imbalance and class imbalance, can be reduced, the dominant effect of a class with a large occupation ratio on the model is weakened, and the deviation of the model is reduced to a great extent.

In addition, as an alternative, the target detection algorithm fast RCNN may be replaced by any other model, such as SSD (Single Shot multi box Detector), YOLOv3 (a target detection algorithm), etc., but is not limited to this network model. The background of the composite image can be a background picture of other data sets, and is not limited to the background of the same data set; the filled object may also be added with the object image pixel blocks of the tagged image. The filled target pixel block of the image can be filled in by data enhancement such as size scaling, turning and the like, so that the problem that the target pixel block of the image exceeds the boundary of the image can be avoided.

The mode is verified to be effective through the experimental verification of a cityscaps data set. Compared with a model with supervised learning training, the average accuracy rate AP (average precision) of the semi-supervised learning model on the same verification set is improved by 2.7%, and the detection rates of traffic lights and traffic signboards are obviously improved. In addition, since the number of labeled data sets is less than that of unlabeled data sets, the labeled images can be reused, for example, one labeled image can be used with 10 unlabeled images, so as to obtain composite images corresponding to 10 unlabeled images.

An embodiment of the present invention provides a schematic structural diagram of a training apparatus for a target detection model, as shown in fig. 8, the apparatus includes: a first obtaining module 80 for obtaining a first image containing a first target object, a second image containing a second target object, and an intermediate model; the first image carries a position label and a category label of a first target object; the second image does not carry the position label and the category label of the second target object; the intermediate model is obtained by pre-training based on the first image; a first output module 81, configured to input the second image to the intermediate model, and output a first prediction result of the specified target object in the second image; wherein the first prediction result comprises a category prediction result and a position prediction result of the specified target object; a merging module 82, configured to merge a second sub-image region corresponding to the specified target object into the first image except the first sub-image region of the first target object based on the position tag and the category tag of the first target object and the first prediction result to obtain a composite image; the synthetic image carries a category prediction result of the specified target object and a corresponding position label of the specified target object in the synthetic image; and the training module 83 is configured to train the intermediate model based on the first image and the synthetic image to obtain a target detection model.

The training device of the target detection model acquires a first image containing a first target object, a second image containing a second target object and an intermediate model; inputting the second image into the intermediate model, and outputting a first prediction result of a specified target object in the second image; merging a second sub-image area corresponding to the specified target object into a first sub-image area except the first target object in the first image to obtain a composite image based on the position label and the category label of the first target object and the first prediction result; and training the intermediate model based on the first image and the synthetic image to obtain a target detection model. The synthesized image in the method is obtained by combining a second sub-image area corresponding to the specified target object in the second image with a first sub-image area except the first target object in the first image, namely, the specified target object in the synthesized image is the detected target object in the second image, so that no missing detection target exists in the synthesized image, the quality of a label carried by the synthesized image is improved, and the performance of a finally trained target detection model is improved.

Further, the system further comprises an intermediate model training module, configured to: performing data enhancement processing on the first image, inputting the enhanced first image into the initial model, and outputting a second prediction result of the first target object in the enhanced first image through the initial model, wherein the second prediction result comprises: position information and category information of the first target object; calculating a first loss value of a second prediction result of the first target object based on the second prediction result and a preset first loss function; updating the weight parameters of the initial model based on the first loss values; and continuing to perform the step of performing data enhancement processing on the first image and inputting the enhanced first image into the initial model until the initial model converges to obtain an intermediate model.

Further, the second image comprises a plurality of second target objects; the first output module is further configured to: inputting the second image into the intermediate model to output a third prediction result of each second target object in the second image through the intermediate model; wherein the third prediction result comprises: a position prediction result, a category prediction result and a confidence of each second target object; and deleting the prediction result with the confidence coefficient smaller than the preset confidence coefficient threshold value from the third prediction result to obtain a first prediction result of the specified target object in the second image.

Further, the synthesis module is further configured to: aiming at each first target object in the first image, acquiring a pixel value mean value of a pixel area corresponding to the first target object; replacing the pixel area of the first target object based on the pixel value mean value to obtain a replacement image; wherein the replacement image includes: dividing a first sub-image area of the first target object in the first image, and a position label and a category label of each first target object; acquiring a second sub-image area corresponding to the specified target object based on a first prediction result of the specified target object in the second image; and merging the second sub-image area into the replacement image based on the position label and the category label of the first target object and the first prediction result to obtain a composite image.

Further, the second image comprises a plurality of designated target objects, and each designated target object has a corresponding second sub-image area; the synthesis module is further configured to: aiming at the current second sub-image area, judging whether a target position of a specified target object corresponding to the current second sub-image area belongs to the same category exists in the replacement image or not based on the category label of the first target object and the category prediction result corresponding to the current second sub-image area; if the target position of the designated target object corresponding to the current second sub-image area in the replacement image belongs to the same category, placing the current second sub-image area at the target position; saving a position label of the target position and a category prediction result of the current second sub-image area; and if the target position of the specified target object corresponding to the current second sub-image area in the replacement image belongs to the same category does not exist, taking the next second sub-image area as a new current second sub-image area, continuously executing the step of judging whether the target position of the specified target object corresponding to the current second sub-image area in the replacement image belongs to the same category exists or not based on the category label of the first target object and the category prediction result corresponding to the current second sub-image area until the traversal of the plurality of second sub-image areas in the second image is completed, and obtaining a synthetic image.

Further, the synthesis module is further configured to: if the target position of the designated target object corresponding to the current second sub-image area in the replacement image belongs to the same category, judging whether the area size of the current second sub-image area exceeds the boundary area of the replacement image; if the area size of the current second sub-image area does not exceed the boundary area of the replacement image, placing the current second sub-image area at the target position; and if the size of the current second sub-image area exceeds the boundary area of the replacement image, continuing to execute the step of judging whether a target position of a specified target object corresponding to the current second sub-image area belongs to the same category exists in the replacement image until the current second sub-image area is placed at the target position.

Further, the training module is further configured to: respectively performing data enhancement processing on the first image and the synthetic image, inputting the enhanced first image and the enhanced synthetic image into the intermediate model, outputting a fourth prediction result of the first target object in the enhanced first image through the intermediate model, and outputting a fifth prediction result of the specified target object in the enhanced synthetic image; the fourth prediction result comprises position prediction information and category prediction information of the first target object; the fifth prediction result includes position prediction information and category prediction information of the specified target object; calculating a second loss value based on the fourth prediction result, the fifth prediction result and a preset second loss function; updating the weight parameters of the intermediate model based on the second loss values; and continuing to perform the step of respectively performing data enhancement processing on the first image and the synthetic image until the intermediate model converges to obtain the target detection model.

The implementation principle and the generated technical effect of the training device of the target detection model provided by the embodiment of the invention are the same as those of the embodiment of the training method of the target detection model, and for the sake of brief description, corresponding contents in the embodiment of the training method of the target detection model can be referred to where the embodiment of the training device of the target detection model is not mentioned.

An embodiment of the present invention further provides a schematic structural diagram of a target detection apparatus, as shown in fig. 9, the apparatus includes: a second obtaining module 90, configured to obtain an image including a target to be detected; the second output module 91 is configured to input the image into a pre-trained target detection model, and output a detection result of a target to be detected; the detection result comprises the category and the position coordinate of the target to be detected; the pre-trained target detection model is obtained through training of a training device of the target detection model.

The target detection device firstly acquires an image containing a target to be detected; then inputting the image into a pre-trained target detection model, and outputting a detection result of the target to be detected; the pre-trained target detection model is obtained by training based on the method in the embodiment, so that the performance of the target detection model is better, and the accuracy of the detection result of the target to be detected in the image can be improved.

Further, the object detection model includes: the system comprises a feature extraction module, a region generation module and a positioning classification module; the second output module is further configured to: inputting the image to a feature extraction module so as to output the target feature of the target to be detected through the feature extraction module; inputting the target characteristics into a regional generation network so as to output a candidate frame containing a target to be detected through the regional generation network; and inputting the target characteristics and the candidate box into a positioning classification module so as to output the category and the position coordinates of the target to be detected through the positioning classification module.

The implementation principle and the technical effect of the object detection device provided by the embodiment of the present invention are the same as those of the aforementioned embodiment of the object detection method, and for the sake of brief description, no mention is made in the embodiment of the object detection device, and reference may be made to the corresponding contents in the aforementioned embodiment of the object detection method.

An embodiment of the present invention further provides an electronic device, as shown in fig. 10, the electronic device includes a processor 130 and a memory 131, the memory 131 stores machine executable instructions capable of being executed by the processor 130, and the processor 130 executes the machine executable instructions to implement the above-mentioned training method for the object detection model, or the object detection method.

Further, the electronic device shown in fig. 10 further includes a bus 132 and a communication interface 133, and the processor 130, the communication interface 133, and the memory 131 are connected through the bus 132.

The Memory 131 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 133 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used. The bus 132 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 10, but this does not indicate only one bus or one type of bus.

The processor 130 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 130. The Processor 130 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 131, and the processor 130 reads the information in the memory 131 and completes the steps of the method of the foregoing embodiment in combination with the hardware thereof.

An embodiment of the present invention further provides a machine-readable storage medium, where the machine-readable storage medium stores machine-executable instructions, and when the machine-executable instructions are called and executed by a processor, the machine-executable instructions cause the processor to implement the above-mentioned training method for the target detection model or the target detection method.

The training method for the target detection model, the target detection method, and the computer program product of the apparatus provided in the embodiments of the present invention include a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute the method described in the foregoing method embodiments, and specific implementations may refer to the method embodiments and are not described herein again.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for training an object detection model, the method comprising:

acquiring a first image containing a first target object, a second image containing a second target object and an intermediate model; the first image carries a position label and a category label of the first target object; the second image does not carry a position tag and a category tag of the second target object; the intermediate model is obtained by pre-training based on the first image;

inputting the second image into the intermediate model, and outputting a first prediction result of a specified target object in the second image; wherein the first prediction result comprises a category prediction result and a location prediction result of the specified target object;

merging a second sub-image area corresponding to the specified target object into the first image except the first sub-image area of the first target object based on the position label and the category label of the first target object and the first prediction result to obtain a composite image; the synthetic image carries the category prediction result of the specified target object and a corresponding position label of the specified target object in the synthetic image;

and training the intermediate model based on the first image and the composite image to obtain a target detection model.

2. The method of claim 1, wherein the intermediate model is trained by:

performing data enhancement processing on the first image, inputting the enhanced first image into an initial model, and outputting a second prediction result of the first target object in the enhanced first image through the initial model, wherein the second prediction result includes: location information and category information of the first target object;

calculating a first loss value of a second prediction result of the first target object based on the second prediction result and a preset first loss function;

updating a weight parameter of the initial model based on the first loss value; and continuing to perform data enhancement processing on the first image, and inputting the enhanced first image into an initial model until the initial model converges to obtain the intermediate model.

3. The method of claim 1, wherein a plurality of second target objects are included in the second image; the step of inputting the second image into the intermediate model and outputting a first prediction result of a specified target object in the second image comprises:

inputting the second image into the intermediate model to output a third prediction result of each second target object in the second image through the intermediate model; wherein the third prediction result comprises: a location prediction result, a category prediction result, and a confidence for each of the second target objects;

and deleting the prediction result with the confidence coefficient smaller than a preset confidence coefficient threshold value from the third prediction result to obtain a first prediction result of the specified target object in the second image.

4. The method according to claim 1, wherein the step of merging the second sub-image region corresponding to the specified target object into the first image except the first sub-image region of the first target object based on the position label and the category label of the first target object and the first prediction result to obtain the composite image comprises:

for each first target object in the first image, obtaining a pixel value mean value of a pixel area corresponding to the first target object;

replacing the pixel area of the first target object based on the pixel value mean value to obtain a replacement image; wherein the replacement image comprises: a first sub-image area of the first target object in the first image, and a position label and a category label of each first target object;

acquiring a second sub-image area corresponding to a specified target object based on a first prediction result of the specified target object in the second image;

merging the second sub-image region to the replacement image based on the position label and the category label of the first target object and the first prediction result to obtain a composite image.

5. The method of claim 4, wherein the second image comprises a plurality of designated target objects, each designated target object having a corresponding second sub-image region; the merging the second sub-image region into the replacement image based on the position label and the category label of the first target object and the first prediction result to obtain a composite image includes:

for a current second sub-image area, judging whether a target position of a specified target object corresponding to the current second sub-image area belongs to the same category exists in the replacement image or not based on the category label of the first target object and the category prediction result corresponding to the current second sub-image area;

if a target position of a specified target object corresponding to the current second sub-image area belongs to the same category exists in the replacement image, placing the current second sub-image area at the target position;

saving the position label of the target position and the category prediction result of the current second sub-image area;

if the target position of the specified target object corresponding to the current second sub-image area in the replacement image belongs to the same category does not exist, taking the next second sub-image area as a new current second sub-image area, and continuing to execute the step of judging whether the target position of the specified target object corresponding to the current second sub-image area in the replacement image belongs to the same category exists or not based on the category label of the first target object and the category prediction result corresponding to the current second sub-image area until the second sub-image areas in the second image are traversed completely to obtain the synthetic image.

6. The method of claim 5, wherein the step of placing the current second sub-image region at the target location if there is a target location in the replacement image for which a specified target object corresponding to the current second sub-image region belongs to the same category comprises:

if the target position of the designated target object corresponding to the current second sub-image area in the replacement image belongs to the same category, judging whether the area size of the current second sub-image area exceeds the boundary area of the replacement image;

if the area size of the current second sub-image area does not exceed the boundary area of the replacement image, placing the current second sub-image area at the target position;

and if the size of the current second sub-image area exceeds the boundary area of the replacement image, continuing to execute the step of judging whether a specified target object corresponding to the current second sub-image area exists in the replacement image and belongs to a target position of the same category until the current second sub-image area is placed at the target position.

7. The method of claim 1, wherein training the intermediate model based on the first image and the composite image to obtain a target detection model comprises:

respectively performing data enhancement processing on the first image and the synthetic image, inputting the enhanced first image and the synthetic image into the intermediate model, so as to output a fourth prediction result of the first target object in the enhanced first image through the intermediate model, and output a fifth prediction result of the specified target object in the enhanced synthetic image; wherein the fourth prediction result comprises location prediction information and category prediction information of the first target object; the fifth prediction result includes position prediction information and category prediction information of the specified target object;

calculating a second loss value based on the fourth prediction result, the fifth prediction result and a preset second loss function;

updating a weight parameter of the intermediate model based on the second loss value; and continuing to perform the step of respectively performing data enhancement processing on the first image and the synthetic image until the intermediate model converges to obtain the target detection model.

8. A method of object detection, the method comprising:

acquiring an image containing a target to be detected;

inputting the image into a pre-trained target detection model, and outputting a detection result of the target to be detected; the detection result comprises the category and the position coordinate of the target to be detected; the pre-trained object detection model is trained by the method of any one of claims 1-7.

9. The method of claim 8, wherein the object detection model comprises: the system comprises a feature extraction module, a region generation module and a positioning classification module;

the step of inputting the image into a pre-trained target detection model and outputting the detection result of the target to be detected comprises the following steps:

inputting the image to the feature extraction module so as to output the target feature of the target to be detected through the feature extraction module;

inputting the target characteristics into the area generation network so as to output a candidate frame containing the target to be detected through the area generation network;

and inputting the target feature and the candidate box into the positioning classification module so as to output the category and the position coordinate of the target to be detected through the positioning classification module.

10. An apparatus for training an object detection model, the apparatus comprising:

a first acquisition module for acquiring a first image containing a first target object, a second image containing a second target object, and an intermediate model; the first image carries a position label and a category label of the first target object; the second image does not carry a position tag and a category tag of the second target object; the intermediate model is obtained by pre-training based on the first image;

a first output module, configured to input the second image to the intermediate model, and output a first prediction result of a specified target object in the second image; wherein the first prediction result comprises a category prediction result and a location prediction result of the specified target object;

a merging module, configured to merge a second sub-image region corresponding to the specified target object into a first sub-image region of the first image, except the first target object, based on the position tag and the category tag of the first target object and the first prediction result, to obtain a synthesized image; the synthetic image carries the category prediction result of the specified target object and a corresponding position label of the specified target object in the synthetic image;

and the training module is used for training the intermediate model based on the first image and the synthetic image to obtain a target detection model.

11. An object detection apparatus, characterized in that the apparatus comprises:

the second acquisition module is used for acquiring an image containing a target to be detected;

the second output module is used for inputting the image into a pre-trained target detection model and outputting a detection result of the target to be detected; the detection result comprises the category and the position coordinate of the target to be detected; the pre-trained target detection model is obtained through training of a training device of the target detection model.

12. An electronic device comprising a processor and a memory, the memory storing machine executable instructions executable by the processor, the processor executing the machine executable instructions to implement the method of training an object detection model according to any one of claims 1 to 7, or the method of object detection according to any one of claims 8 to 9.

13. A machine-readable storage medium having stored thereon machine-executable instructions which, when invoked and executed by a processor, cause the processor to carry out a method of training an object detection model as claimed in any one of claims 1 to 7 or a method of object detection as claimed in any one of claims 8 to 9.