CN113869361A

CN113869361A - Model training method, target detection method and related device

Info

Publication number: CN113869361A
Application number: CN202110963178.6A
Authority: CN
Inventors: 陈海波; 罗志鹏
Original assignee: Shenyan Technology Beijing Co ltd
Current assignee: Shenyan Technology Beijing Co ltd
Priority date: 2021-08-20
Filing date: 2021-08-20
Publication date: 2021-12-31

Abstract

The application provides a model training method, a target detection method and a related device, wherein the model training method is used for training a preset deep neural network, the preset deep neural network comprises a prediction module, the prediction module uses Cascade RCNN and uses CBNet as a feature extraction network of the Cascade RCNN, and the model training method comprises the following steps: acquiring a training data set, wherein each training data in the training data set comprises a training image and label detection information corresponding to the training image, and the label detection information corresponding to the training image comprises label classification information and label bounding box information corresponding to the training image; and training the preset deep neural network by using the training data set to obtain a target detection model. The model training method can obtain a target detection model which is more stable, higher in accuracy and wide in application range.

Description

Model training method, target detection method and related device

Technical Field

The present application relates to the field of deep learning technologies, and in particular, to a model training method, a target detection method, and a related apparatus.

Background

Target detection is a very popular research direction in the field of computer vision at present and is an important link in the fields of unmanned driving technology and the like.

The prior art CN110942000A discloses a method for detecting an unmanned vehicle target based on deep learning, which samples a target object by generating a three-dimensional template of the target object, and generates a candidate frame for an input image by combining the generated three-dimensional template and an object sampling strategy; extracting the characteristics of the generated candidate frame to construct a target function; based on the obtained target function, training the weight of the target function and primarily detecting a target object by using a structured support vector machine classifier; improving a regional candidate network and constructing a high-efficiency HRPN network; training a fast RCNN monitoring model based on the constructed HRPN network, inputting a preliminary detection result obtained by a structured support vector machine classifier into the network for training, and storing model parameter information and structure information for target detection after training.

However, the above target detection method still has a very critical problem, namely that the stability is poor, the accuracy is low, and the application range is affected.

Disclosure of Invention

The application aims to provide a model training method, a target detection method and a related device, so that a target detection model obtained through training is more stable, higher in accuracy and wide in application range.

The purpose of the application is realized by adopting the following technical scheme:

in a first aspect, the present application provides a model training method for training a preset deep neural network, where the preset deep neural network includes a prediction module, the prediction module uses Cascade RCNN and CBNet as a feature extraction network of the Cascade RCNN, the model training method includes: acquiring a training data set, wherein each training data in the training data set comprises a training image and label detection information corresponding to the training image, and the label detection information corresponding to the training image comprises label classification information and label bounding box information corresponding to the training image; and training the preset deep neural network by using the training data set to obtain a target detection model.

The technical scheme has the beneficial effects that the CBNet is used as the feature extraction network of the Cascade RCNN, the CBNet has stronger feature extraction capability and higher precision, and further can be applied to more scenes. The target detection model obtained by training by the method is used for executing the target detection task, and is more stable, higher in accuracy and wide in application range.

In some optional embodiments, the preset deep neural network further includes a data augmentation module, and the training the preset deep neural network with the training data set to obtain a target detection model includes: inputting at least one training image into the data augmentation module to obtain an augmentation image corresponding to the at least one training image; taking at least one training image and corresponding label detection information thereof as a source domain, taking an augmented image corresponding to at least one training image as an augmented area, and training the preset deep neural network by using the source domain and the augmented area so as to reduce the data distribution difference between the augmented area and the source domain; acquiring label detection information of an augmented image corresponding to at least one training image; acquiring a target domain; taking an augmented image corresponding to at least one training image and label detection information corresponding to the augmented image as a new augmented domain, and training the preset deep neural network by using the new augmented domain and the target domain to reduce the data distribution difference between the augmented domain and the target domain; and taking the trained preset deep neural network as the target detection model.

The training image may be a training image obtained under some specific weather conditions, shelters, road congestion, or the like. The technical scheme has the advantages that the training image is subjected to data amplification through the data amplification module, so that a training data set can be diversified as much as possible, a target detection model obtained by training has strong generalization capability, and the method is also suitable for training images obtained under the conditions of specific weather, shelters or road congestion and the like, and is wide in application range; training the preset deep neural network through the source domain and the augmented wide area, so that the data distribution difference between the augmented wide area and the source domain can be reduced; and training the preset deep neural network by using the new increased area and the target area, so that the data distribution difference between the increased area and the target area can be reduced.

In some optional embodiments, the data augmentation module is a generator, the preset deep neural network further includes a feature extraction module, a gradient inversion layer, and a domain discriminator, and the training of the preset deep neural network includes: inputting an input image into the feature extraction module to obtain first feature information and second feature information corresponding to the input image, wherein the input image has corresponding label detection information or does not have corresponding label detection information; inputting first feature information corresponding to the input image into the prediction module to obtain prediction detection information corresponding to the input image, wherein the prediction detection information corresponding to the input image comprises prediction classification information and prediction bounding box information corresponding to the input image; when the input image has corresponding label detection information, training the prediction module based on the label detection information and the prediction detection information corresponding to the input image; inputting first characteristic information and second characteristic information corresponding to the input image into the gradient inversion layer to obtain gradient inversion information corresponding to the input image; inputting the gradient inversion information corresponding to the input image into the domain discriminator to obtain domain discrimination information corresponding to the input image; training the generator and the domain discriminator in a counterlearning manner based on domain discrimination information corresponding to the input image.

The technical scheme has the advantages that the first characteristic information and the second characteristic information corresponding to the input image are obtained through the characteristic extraction module, the first characteristic information and the second characteristic information are sent to the gradient inversion layer and the domain discriminator, the generator and the domain discriminator are trained in an antagonistic learning mode, the domain invariant characteristics are learned in an antagonistic mode, and the antagonistic robustness of the preset deep neural network is improved.

In some alternative embodiments, the prediction module comprises a feature extraction network and a dual-headed structure; the inputting the first feature information corresponding to the input image into the prediction module to obtain the prediction detection information corresponding to the input image includes: inputting first feature information corresponding to the input image into the feature extraction network to obtain feature extraction information corresponding to the input image; and inputting the feature extraction information corresponding to the input image into the double-head structure to obtain the prediction detection information corresponding to the input image.

The technical scheme has the advantages that the first feature information corresponding to the input image is input into the feature extraction network to obtain the feature extraction information corresponding to the input image, the feature extraction information corresponding to the input image obtained by the feature extraction network is input into the double-head structure, and the prediction detection information corresponding to the input image can be obtained.

In some optional embodiments, the inputting the first feature information corresponding to the input image into the feature extraction network to obtain the feature extraction information corresponding to the input image includes: the long sides in the width and the height of the input image are zoomed to preset length values, and the short sides in the width and the height of the input image are zoomed to any value in a preset length range; determining a plurality of input images including the input image; filling the short sides of the rest input images to a reference value by taking the maximum value of the short sides in the plurality of input images as the reference value; inputting the plurality of input images into the feature extraction network in a batch mode to obtain feature extraction information corresponding to the plurality of input images, wherein the feature extraction information corresponding to the plurality of input images comprises the feature extraction information corresponding to the input images.

The technical scheme has the advantages that the spatial-level image enhancement can be performed on a plurality of images in a data set in a batch mode to remove image noise, and the structural information of the original image cannot be damaged.

In some optional embodiments, the feature extraction network comprises Stage₁、Stage₂、Stage₃、Stage₄、Stage_{1_1}、Stage_{2_2}、Stage_{3_3}、Stage_{4_4}And a first up-sampling unit to a third up-sampling unit, wherein the first feature information corresponding to the input image is input into the feature extraction network to obtain the correspondence of the input imageThe feature extraction information of (1), comprising: inputting first characteristic information corresponding to the input image into Stage₁Obtaining a characteristic map F corresponding to the input image₁(ii) a The feature map F corresponding to the input image₁Input Stage_{1_1}Obtaining a characteristic map F corresponding to the input image₂(ii) a The feature map F corresponding to the input image₁Input Stage₂Obtaining a characteristic map F corresponding to the input image₃(ii) a The feature map F corresponding to the input image₃And feature map F₂Added to Stage_{2_2}Obtaining a characteristic map F corresponding to the input image₄(ii) a The feature map F corresponding to the input image₃Input Stage₃Obtaining a characteristic map F corresponding to the input image₅(ii) a The feature map F corresponding to the input image₅And feature map F₄Added to Stage_{3_3}Obtaining a characteristic map F corresponding to the input image₆(ii) a The feature map F corresponding to the input image₅Input Stage₄Obtaining a characteristic map F corresponding to the input image₇(ii) a The feature map F corresponding to the input image₇And feature map F₆Added to Stage_{4_4}Obtaining a characteristic map F corresponding to the input image₈And corresponding characteristic diagram F of the input image₈As a fusion feature M corresponding to the input image₃(ii) a The feature map F corresponding to the input image₈Inputting the third up-sampling unit to obtain a feature map F corresponding to the input image₈And the feature map F corresponding to the input image is obtained₈And a feature map F corresponding to the input image₆Adding to obtain the corresponding fusion feature M of the input image₂(ii) a Corresponding fusion characteristics M of the input image₂Inputting a second up-sampling unit to obtain a fusion feature M corresponding to the input image₂And the corresponding fusion feature M of the input image is obtained₂And a feature map F corresponding to the input image₄Adding to obtain the corresponding fusion characteristics of the input imagesM₁(ii) a Corresponding fusion characteristics M of the input image₁Inputting the first up-sampling unit to obtain the fusion feature M corresponding to the input image₁And the corresponding fusion feature M of the input image is obtained₁And a feature map F corresponding to the input image₂Adding to obtain the corresponding fusion feature M of the input image₀(ii) a Corresponding fusion characteristics M of the input image₃The fusion feature M corresponding to the input image₂Fusion feature M corresponding to input image₁Fusion feature M corresponding to the input image₀And extracting information as the characteristic corresponding to the input image.

The technical scheme has the beneficial effect that a plurality of fusion features corresponding to the input image are used as feature extraction information corresponding to the input image.

In some optional embodiments, the double-headed structure includes a convolution layer, a first-stage network and a second-stage network, the first-stage network includes a bounding box extraction unit, a two-class network and a first-stage regression network, the second-stage network includes a first multi-class network to a third multi-class network and a first regression network to a third regression network, and the inputting the feature extraction information corresponding to the input image into the double-headed structure to obtain the prediction detection information corresponding to the input image includes: inputting the feature extraction information corresponding to the input image into the convolution layer to obtain a convolution result corresponding to the input image; inputting the convolution result corresponding to the input image into the bounding box extraction unit to obtain first-stage bounding box information corresponding to the input image; acquiring second-stage boundary box information corresponding to the input image by using the first-stage boundary box information corresponding to the input image, the two-classification network and the first-stage regression network; acquiring first bounding box information corresponding to the input image by utilizing second-stage bounding box information corresponding to the input image, the first multi-classification network and the first regression network; acquiring second bounding box information corresponding to the input image by using the first bounding box information corresponding to the input image, the second multi-classification network and the second regression network; and acquiring the prediction detection information corresponding to the input image by using the second bounding box information corresponding to the input image, the third multi-classification network and the third regression network.

The technical scheme has the advantages that the classification task usually needs more image semantic information, the regression task needs more spatial information, the double-head structure is adopted, characteristics of different requirements are considered, and the effect is more obvious.

In a second aspect, the present application provides a target detection method, including: acquiring an image to be detected; inputting the image to be detected into a target detection model to obtain the corresponding prediction detection information of the image to be detected; the target detection model is obtained by training by using any one of the model training methods.

The technical scheme has the advantages that the image to be detected is input into the target detection model, and the prediction detection information corresponding to the image to be detected can be accurately and stably obtained.

In a third aspect, the present application provides a model training apparatus for training a preset deep neural network, where the preset deep neural network includes a prediction module, the prediction module uses Cascade RCNN and uses CBNet as a feature extraction network of the Cascade RCNN, the model training apparatus includes:

the training data set part is used for acquiring a training data set, each piece of training data in the training data set comprises a training image and label detection information corresponding to the training image, and the label detection information corresponding to the training image comprises label classification information and label boundary box information corresponding to the training image;

and the model training part is used for training the preset deep neural network by using the training data set to obtain a target detection model.

In some optional embodiments, the preset deep neural network further includes a data augmentation module, and the model training part includes:

the augmented image module is used for inputting at least one training image into the data augmented module to obtain an augmented image corresponding to the at least one training image;

the first training module is used for taking at least one training image and corresponding label detection information thereof as a source domain, taking an augmented image corresponding to at least one training image as an augmented area, and training the preset deep neural network by using the source domain and the augmented area so as to reduce the data distribution difference between the augmented area and the source domain;

the annotation information acquisition module is used for acquiring annotation detection information of the augmented image corresponding to at least one training image;

the target domain acquiring module is used for acquiring a target domain;

the second training module is used for taking an augmented image corresponding to at least one training image and label detection information corresponding to the augmented image as a new augmented domain, and training the preset deep neural network by using the new augmented domain and the target domain so as to reduce the data distribution difference between the augmented domain and the target domain;

and the target detection module is used for taking the trained preset deep neural network as the target detection model.

In some optional embodiments, the data augmentation module is a generator, the preset deep neural network further includes a feature extraction module, a gradient inversion layer, and a domain discriminator, and the first training module and the second training module each include:

the characteristic extraction submodule is used for inputting an input image into the characteristic extraction module to obtain first characteristic information and second characteristic information corresponding to the input image, wherein the input image has corresponding label detection information or does not have corresponding label detection information;

the first prediction sub-module is used for inputting first feature information corresponding to the input image into the prediction module to obtain prediction detection information corresponding to the input image, wherein the prediction detection information corresponding to the input image comprises prediction classification information and prediction boundary box information corresponding to the input image;

the first training sub-module is used for training the prediction module based on the label detection information and the prediction detection information corresponding to the input image when the input image has the corresponding label detection information;

the gradient inversion submodule is used for inputting first characteristic information and second characteristic information corresponding to the input image into the gradient inversion layer to obtain gradient inversion information corresponding to the input image;

the domain identification submodule is used for inputting the gradient inversion information corresponding to the input image into the domain identifier to obtain domain identification information corresponding to the input image;

and the countercheck learning sub-module is used for training the generator and the domain discriminator in a countercheck learning mode based on the domain discrimination information corresponding to the input image.

In some alternative embodiments, the prediction module comprises a feature extraction network and a dual-headed structure;

the first prediction sub-module includes:

the feature extraction unit is used for inputting first feature information corresponding to the input image into the feature extraction network to obtain feature extraction information corresponding to the input image;

and the double-head structure unit is used for inputting the feature extraction information corresponding to the input image into the double-head structure to obtain the prediction detection information corresponding to the input image.

In some optional embodiments, the feature extraction unit includes:

the image scaling subunit is used for scaling the long sides in the width and the height of the input image to preset length values and scaling the short sides in the width and the height of the input image to any value in a preset length range;

an image determining subunit configured to determine a plurality of input images including the input image;

an image filling subunit, configured to fill the short edges of the remaining input images to a reference value, where the reference value is a maximum value of the short edges in the plurality of input images;

and the batch input subunit is used for inputting the plurality of input images into the feature extraction network in a batch mode to obtain feature extraction information corresponding to the plurality of input images, wherein the feature extraction information corresponding to the plurality of input images comprises the feature extraction information corresponding to the input images.

In some optional embodiments, the feature extraction network comprises Stage₁、Stage₂、Stage₃、Stage₄、Stage_{1_1}、Stage_{2_2}、Stage_{3_3}、Stage_{4_4}And first to third upsampling units, the feature extraction unit including:

a first feature map subunit, configured to input first feature information corresponding to the input image to Stage1, so as to obtain a feature map F1 corresponding to the input image;

the second feature map subunit is configured to input the feature map F1 corresponding to the input image to Stage1_1, so as to obtain a feature map F2 corresponding to the input image;

a third feature map subunit, configured to input a feature map F1 corresponding to the input image into Stage2, so as to obtain a feature map F3 corresponding to the input image;

a fourth feature map subunit, configured to add the feature map F3 and the feature map F2 corresponding to the input image, and input the result to Stage2_2, so as to obtain a feature map F4 corresponding to the input image;

a fifth feature map subunit, configured to input the feature map F3 corresponding to the input image into Stage3, so as to obtain a feature map F5 corresponding to the input image;

a sixth feature map subunit, configured to add the feature map F5 and the feature map F4 corresponding to the input image, and input the result to Stage3_3 to obtain a feature map F6 corresponding to the input image;

a seventh feature map subunit, configured to input the feature map F5 corresponding to the input image into Stage4, so as to obtain a feature map F7 corresponding to the input image;

an eighth feature map subunit, configured to add the feature map F7 and the feature map F6 corresponding to the input image, and input the result to Stage4_4 to obtain a feature map F8 corresponding to the input image, and use the feature map F8 corresponding to the input image as the fusion feature M3 corresponding to the input image;

a third sampling subunit, configured to input the feature map F8 corresponding to the input image into the third upsampling unit, to obtain an upsampling result of the feature map F8 corresponding to the input image, and add the upsampling result of the feature map F8 corresponding to the input image and the feature map F6 corresponding to the input image, to obtain a fused feature M2 corresponding to the input image;

a second sampling subunit, configured to input the fusion feature M2 corresponding to the input image into a second upsampling unit, to obtain an upsampling result of the fusion feature M2 corresponding to the input image, and add the upsampling result of the fusion feature M2 corresponding to the input image and the feature map F4 corresponding to the input image, to obtain a fusion feature M1 corresponding to the input image;

a first sampling sub-unit, configured to input the fusion feature M1 corresponding to the input image into the first upsampling unit, to obtain an upsampling result of the fusion feature M1 corresponding to the input image, and add the upsampling result of the fusion feature M1 corresponding to the input image and the feature map F2 corresponding to the input image, to obtain a fusion feature M0 corresponding to the input image;

and the feature information subunit is used for taking the fusion feature M3 corresponding to the input image, the fusion feature M2 corresponding to the input image, the fusion feature M1 corresponding to the input image and the fusion feature M0 corresponding to the input image as feature extraction information corresponding to the input image.

In some optional embodiments, the dual-headed structure comprises a convolutional layer, a first-stage network and a second-stage network, the first-stage network comprising a bounding box extraction unit, a two-stage network and a first-stage regression network, the second-stage network comprising a first multi-stage network to a third multi-stage network and a first regression network to a third regression network, the dual-headed structure unit comprising:

a convolution subunit, configured to input feature extraction information corresponding to the input image into the convolution layer, so as to obtain a convolution result corresponding to the input image;

the first bounding box subunit is used for inputting the convolution result corresponding to the input image into the bounding box extraction unit to obtain first-stage bounding box information corresponding to the input image;

the second bounding box subunit is configured to obtain second-stage bounding box information corresponding to the input image by using the first-stage bounding box information corresponding to the input image, the two-class network, and the first-stage regression network;

a first information subunit, configured to obtain first bounding box information corresponding to the input image by using second-stage bounding box information corresponding to the input image, the first multi-classification network, and the first regression network;

a second information subunit, configured to obtain second bounding box information corresponding to the input image by using the first bounding box information corresponding to the input image, the second multi-classification network, and the second regression network;

and the information prediction subunit is configured to acquire prediction detection information corresponding to the input image by using the second bounding box information corresponding to the input image, the third multi-classification network, and the third regression network.

In a fourth aspect, the present application provides an object detection apparatus comprising:

the image module to be detected is used for acquiring an image to be detected;

the image prediction module is used for inputting the image to be detected into a target detection model to obtain prediction detection information corresponding to the image to be detected;

the target detection model is obtained by training by using any one of the model training methods.

In a fifth aspect, the present application provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of any one of the above model training methods or the above target detection method when executing the computer program.

In a sixth aspect, the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program or an object detection model;

the computer program when executed by a processor implementing the steps of any of the above described model training methods or the steps of the above described target detection methods;

the target detection model is obtained by training by any one of the model training methods.

Drawings

The present application is further described below with reference to the drawings and examples.

FIG. 1 is a schematic flow chart diagram illustrating a model training method according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart illustrating a process for obtaining a target detection model according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of training a preset deep neural network according to an embodiment of the present application;

FIG. 4 is a schematic flow chart illustrating adapting an augmented image according to the present annotation detection information according to an embodiment of the present application;

fig. 5 is a schematic flowchart of obtaining the prediction detection information according to an embodiment of the present application;

fig. 6 is a schematic flowchart of obtaining feature extraction information of a plurality of input images according to an embodiment of the present application;

fig. 7 is a schematic flowchart of obtaining feature extraction information according to an embodiment of the present application;

fig. 8 is a schematic flow chart of upsampling using caraafe according to an embodiment of the present application;

fig. 9 is a schematic flowchart of obtaining feature extraction information according to an embodiment of the present application;

fig. 10 is a schematic flowchart illustrating a process of obtaining prediction detection information corresponding to an input image according to an embodiment of the present application;

FIG. 11 is a schematic flow chart of a dual head structure provided by an embodiment of the present application;

FIG. 12 is a schematic flow chart of another dual head structure provided by an embodiment of the present application;

FIG. 13 is a schematic flow chart of another dual head structure provided by an embodiment of the present application;

FIG. 14 is a schematic flow chart of another dual head structure provided by an embodiment of the present application;

fig. 15 is a schematic flowchart of a target detection method according to an embodiment of the present application;

FIG. 16 is a schematic flow chart illustrating a further method for detecting an object according to an embodiment of the present disclosure;

FIG. 17 is a schematic structural diagram of a model training apparatus according to an embodiment of the present disclosure;

FIG. 18 is a schematic structural diagram of a model training section according to an embodiment of the present disclosure;

FIG. 19 is a schematic diagram of a configuration of a resistance training module according to an embodiment of the present application;

FIG. 20 is a block diagram of a first prediction sub-module according to an embodiment of the present disclosure;

fig. 21 is a schematic structural diagram of a feature extraction unit provided in an embodiment of the present application;

fig. 22 is a schematic structural diagram of another feature extraction unit provided in an embodiment of the present application;

FIG. 23 is a schematic structural diagram of a dual-head structural unit provided in an embodiment of the present application;

fig. 24 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present application;

fig. 25 is a block diagram of an electronic device according to an embodiment of the present application;

fig. 26 is a schematic structural diagram of a program product for implementing a model training method or an object detection method according to an embodiment of the present application.

Detailed Description

The present application is further described with reference to the accompanying drawings and the detailed description, and it should be noted that, in the present application, the embodiments or technical features described below may be arbitrarily combined to form a new embodiment without conflict.

The terms "first," "second," "third," "fourth," "fifth," "sixth," "seventh," "eighth," "ninth," and the like in the description and in the claims of the present application and in the above-described drawings (if any) are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "corresponding" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Referring to fig. 1, an embodiment of the present application provides a model training method, configured to train a preset deep neural network, where the preset deep neural network includes a prediction module, the prediction module uses Cascade RCNN and uses CBNet as a feature extraction network of the Cascade RCNN, and the model training method includes steps S101 to S102.

Step S101: acquiring a training data set, wherein each training data in the training data set comprises a training image and label detection information corresponding to the training image, and the label detection information corresponding to the training image comprises label classification information and label bounding box information corresponding to the training image.

Step S102: and training the preset deep neural network by using the training data set to obtain a target detection model.

The target detection model obtained by training the preset deep neural network is not limited, and the target detection model obtained by training the preset deep neural network can be a target detection model of a driving scene, a target detection model of an unmanned aerial vehicle and the like. The preset deep neural network includes a prediction module that uses Cascade RCNN and uses CBNet as a feature extraction network for Cascade RCNN.

In a specific application scene, a preset deep neural network is trained, and a target detection model obtained by training the preset deep neural network is a target detection model of a driving scene. The method comprises the steps of firstly, obtaining a training data set for training a preset deep neural network, wherein each training data in the training data set comprises a training driving scene image and label detection information corresponding to the training driving scene image, and the label detection information corresponding to the training image comprises obstacle classification information, obstacle boundary box information, road marking information and road boundary box information. The obstacle classification information is used for marking the type of the obstacle on the corresponding training driving scene image, and the obstacle bounding box information is used for marking the bounding box of the obstacle on the corresponding training driving scene image. The road mark information is used for marking the type of the road mark on the corresponding training driving scene image, and the road mark boundary box information is used for marking the boundary box of the road mark on the corresponding training driving scene image. And training the preset deep neural network by using the training data set to obtain a target detection model of the driving scene. Compared with the original characteristic extraction network of Cascade RCNN, the CBNet has stronger characteristic extraction capability and higher precision, and can be further applied to more scenes.

Therefore, the target detection model obtained by training by the method is used for executing the target detection task, and is more stable, higher in accuracy and wide in application range.

Referring to fig. 2, in some optional embodiments, the preset deep neural network further includes a data augmentation module, and the step S102 may include steps S201 to S206.

S201: and inputting at least one training image into the data augmentation module to obtain an augmentation image corresponding to at least one training image. The corresponding augmented images are obtained through the training image input data augmentation module, and training images similar to the training images but different from the training images can be generated, so that the scale of the training data set is enlarged.

S202: and taking at least one training image and corresponding label detection information thereof as a source domain, taking at least one augmented image corresponding to the training image as an augmented area, and training the preset deep neural network by using the source domain and the augmented area so as to reduce the data distribution difference between the augmented area and the source domain.

S203: and acquiring label detection information of the augmented image corresponding to at least one training image.

S204: and acquiring the target domain.

S205: and taking at least one augmented image corresponding to the training image and the corresponding label detection information thereof as a new augmented domain, and training the preset deep neural network by using the new augmented domain and the target domain so as to reduce the data distribution difference between the augmented domain and the target domain.

S206: and taking the trained preset deep neural network as the target detection model.

The training image may be a training image obtained under some specific weather conditions, shelters, road congestion, or the like. Therefore, the training image is subjected to data amplification through the data amplification module, so that a training data set can be diversified as much as possible, a target detection model obtained by training has strong generalization capability, and the method is also suitable for training images obtained under the conditions of specific weather, shelters or road congestion and the like, and is wide in application range; training the preset deep neural network through the source domain and the augmented wide area, so that the data distribution difference between the augmented wide area and the source domain can be reduced; and training the preset deep neural network by using the new increased area and the target area, so that the data distribution difference between the increased area and the target area can be reduced.

Referring to fig. 3, in some alternative embodiments, the data augmentation module may be a generator, the preset deep neural network may further include a feature extraction module, a gradient inversion layer, and a domain discriminator, and the steps of training the preset deep neural network in steps S202 and S205 may include steps S301 to S306. Wherein, the gradient inversion layer is a gradient reverse layer, which is abbreviated as GRL.

Step S301: inputting an input image into the feature extraction module to obtain first feature information and second feature information corresponding to the input image, wherein the input image has corresponding label detection information or does not have corresponding label detection information.

Step S302: and inputting the first characteristic information corresponding to the input image into the prediction module to obtain prediction detection information corresponding to the input image, wherein the prediction detection information corresponding to the input image comprises prediction classification information and prediction bounding box information corresponding to the input image.

Step S303: and when the input image has corresponding label detection information, training the prediction module based on the label detection information and the prediction detection information corresponding to the input image.

Step S304: and inputting the first characteristic information and the second characteristic information corresponding to the input image into the gradient inversion layer to obtain the gradient inversion information corresponding to the input image.

Step S305: and inputting the gradient inversion information corresponding to the input image into the domain discriminator to obtain the domain discrimination information corresponding to the input image.

Step S306: training the generator and the domain discriminator in a counterlearning manner based on domain discrimination information corresponding to the input image.

In a specific application scenario, the generator is a generator g (generator) learned through a cyclic generation network (CycleGAN), and the domain evaluator is an evaluator Dcycle in the CycleGAN. And obtaining first characteristic information and second characteristic information corresponding to the input image by the prediction module from the input image with the label detection information and the input image without the label detection information.

On one hand, the first feature information corresponding to the input image is input into the prediction module to obtain the prediction detection information corresponding to the input image, and the prediction module is trained through the annotation detection information and the prediction detection information corresponding to the input image. On the other hand, the obtained first feature information and second feature information are input to a gradient inversion layer, and gradient inversion information corresponding to the input image is obtained. Inputting the gradient inversion information corresponding to the input image into the domain discriminator to obtain domain discrimination information corresponding to the input image, training the generator and the domain discriminator in an antagonistic learning mode based on the obtained domain discrimination information, learning the invariant features of the domain in an antagonistic mode, and improving the antagonistic robustness of the preset deep neural network.

Therefore, the first characteristic information and the second characteristic information corresponding to the input image are obtained through the characteristic extraction module, the first characteristic information and the second characteristic information are sent to the gradient inversion layer and the domain discriminator, the generator and the domain discriminator are trained in an antagonistic learning mode, the domain invariant characteristics are learned in the antagonistic mode, and the antagonistic robustness of the preset deep neural network is improved.

Referring to fig. 4, in a specific application scenario, a source image is first transformed using a generator G learned through CycleGAN to generate a composite image. Thereafter, the labeled source domain is used and a first phase adaptation to the synthesized domain is performed. This is followed by a second stage of adaptation, which takes the labeled synthetic domain and aligns the synthetic domain features with the target distribution. In addition, the weight w is obtained from the discriminator Dcycle in the CycleGAN to balance the quality of the synthesized image in the detection loss, and the purpose of adapting to the augmented image through the existing label detection information is achieved.

Referring to fig. 5, in some alternative embodiments, the prediction module may include a feature extraction network and a dual-headed structure, and the step S302 may include steps S401 to S402.

Step S401: and inputting the first feature information corresponding to the input image into the feature extraction network to obtain the feature extraction information corresponding to the input image.

Step S402: and inputting the feature extraction information corresponding to the input image into the double-head structure to obtain the prediction detection information corresponding to the input image.

Therefore, the input feature extraction network obtains the feature extraction information corresponding to the input image according to the first feature information corresponding to the image, and the feature extraction information corresponding to the input image obtained by the feature extraction network is input into the double-head structure, so that the prediction detection information corresponding to the input image can be obtained.

Referring to fig. 6, in some alternative embodiments, the step S401 may include steps S501 to S504.

Step S501: and scaling the long sides in the width and the height of the input image to preset length values, and scaling the short sides in the width and the height of the input image to any value in a preset length range.

Step S502: a plurality of input images including the input image is determined.

Step S503: and filling the short sides of the rest input images to the reference value by taking the maximum value of the short sides in the plurality of input images as the reference value.

Step S504: inputting the plurality of input images into the feature extraction network in a batch mode to obtain feature extraction information corresponding to the plurality of input images, wherein the feature extraction information corresponding to the plurality of input images comprises the feature extraction information corresponding to the input images.

In a specific application, the expression for filling the short sides of the rest of the input images to the reference value with the maximum value of the short sides in the plurality of input images as the reference value is as follows:

S_base＝S_i+padding

wherein the plurality of input images are a plurality of images sampled randomly in the data set, for the plurality of sampled images (Ii), the width (Ii _ w) and the height (Ii _ h) of the image itself are compared, the long side (max (Ii _ w, Ii _ h)) of the width and the height is selected to be scaled to L, the short side (min (Ii _ w, Ii _ h)) is scaled to S, and S is selected randomly from S1 to S2. When a plurality of sampled images (Ii (i ═ 1,2,3 … n)) are sent to a feature extraction network in the form of a batch (batch), where the long side of all multi-frame images in the batch is L and the short sides of the images are uniform in size, the width of the image whose short side is not the maximum value is increased by padding (padding) to a width equal to the width of the reference (S _ base) based on the maximum value (max (Si)) among the short sides (Si (i ═ 1,2,3 … n)) of the plurality of images in the entire batch. Wherein L is 2048, and the short sides S1-S2 are 768-1080.

Therefore, spatial-level image enhancement can be performed on a plurality of images in a data set in a batch data mode, image noise is removed, and structural information of the original images cannot be damaged.

Referring to fig. 7-9, in some alternative embodiments, the feature extraction network may include Stage₁、Stage₂、Stage₃、Stage₄、Stage_{1_1}、Stage_{2_2}、Stage_{3_3}、Stage_{4_4}And a first up-sampling unit to a third up-sampling unit, the step S401 may include steps S601 to S612.

Wherein Stage is₁、Stage₂、Stage₃、Stage₄、Stage_{1_1}、Stage_{2_2}、Stage_{3_3}、Stage_{4_4}Each Stage in (a) may have the same structure or may have a different structure.

S601: inputting first characteristic information corresponding to the input image into Stage₁Obtaining a characteristic map F corresponding to the input image₁。

S602: the feature map F corresponding to the input image₁Input Stage_{1_1}Obtaining a characteristic map F corresponding to the input image₂。

S603: the feature map F corresponding to the input image₁Input Stage₂Obtaining a characteristic map F corresponding to the input image₃。

S604: the feature map F corresponding to the input image₃And feature map F₂Added to Stage_{2_2}Obtaining a characteristic map F corresponding to the input image₄。

S605: the feature map F corresponding to the input image₃Input Stage₃Obtaining a characteristic map F corresponding to the input image₅。

S606: the feature map F corresponding to the input image₅And feature map F₄Added to Stage_{3_3}Obtaining a characteristic map F corresponding to the input image₆。

S607: the feature map F corresponding to the input image₅Input Stage₄Obtaining a characteristic map F corresponding to the input image₇。

S608: the feature map F corresponding to the input image₇And feature map F₆Added to Stage_{4_4}Obtaining a characteristic map F corresponding to the input image₈And corresponding characteristic diagram F of the input image₈As a fusion feature M corresponding to the input image₃。

S609: the feature map F corresponding to the input image₈Inputting the third up-sampling unit to obtain a feature map F corresponding to the input image₈And the feature map F corresponding to the input image is obtained₈And a feature map F corresponding to the input image₆Adding to obtain the corresponding fusion feature M of the input image₂。

S610: corresponding fusion characteristics M of the input image₂Inputting a second up-sampling unit to obtain a fusion feature M corresponding to the input image₂And the corresponding fusion feature M of the input image is obtained₂And a feature map F corresponding to the input image₄Adding to obtain the corresponding fusion feature M of the input image₁。

S611: corresponding fusion characteristics M of the input image₁Inputting the first up-sampling unit to obtain the fusion feature M corresponding to the input image₁And the corresponding fusion feature M of the input image is obtained₁And a feature map F corresponding to the input image₂Adding to obtain the corresponding fusion feature M of the input image₀。

S612: corresponding fusion characteristics M of the input image₃The fusion feature M corresponding to the input image₂Fusion feature M corresponding to input image₁Fusion feature M corresponding to the input image₀And extracting information as the characteristic corresponding to the input image.

In a specific application scenario, any input image I (I ═ 1,2,3 … n) of a plurality of images (Ii (I ═ 1,2,3 … n)) in a dataset passes through Stage₁Then, a feature map F is generated₁，F₁As Stage₁StStage side by side transversely_{1_1}Input characteristic of (1), F₁Passing through Stage_{1_1}Post-production profile F₂；F₁Passing through Stage₂Then, a feature map F is generated₃，F₃And F₂Added to obtain Stage₂Stage side by side in transverse direction_{2_2}Input features of (1), via Stage_{2_2}Post-production profile F₄；F₃Passing through Stage₃Then, a feature map F is generated₅，F₅And F₄Added to obtain Stage₃Stage side by side in transverse direction_{3_3}Input features of (1), via Stage_{3_3}Post-production profile F₆；F₅Passing through Stage₄Then, a feature map F is generated₇，F₇And F₆Added to obtain Stage₄Stage side by side in transverse direction_{4_4}Input features of (1), via Stage_{4_4}Post-production profile F₈. Extracting F produced by the above process₂、F₄、F₆、F₈，F₈After upsampling, form a sum F₆Feature maps of the same size, same channel, added to fuse stages_{4_4}And Stage_{3_3}Characteristics M of the phases₂；M₂After upsampling, form a sum F₄Feature maps of the same size, same channel, added to fuse stages_{3_3}And Stage_{2_2}The phase is characterized by M₁；M₁After upsampling, form a sum F₂Feature maps of the same size, same channel, added to fuse stages_{2_2}And Stage_{1_1}The phase is characterized by M₀(ii) a F is to be₈Directly as M₃And (6) outputting.

In a specific application scenario, as shown in fig. 8, the upsampling method employs caraafe. For shapes H x W x CInputting a feature map, compressing the number of channels of the input feature map to H × W × C by using a 1 × 1 convolution_mThe convolution is used for compressing the number of channels of the input feature map, so that the calculation amount of the subsequent step can be reduced.

For the compressed input feature map, use a k_encoder×k_encoderPredicting the upsampled core by the convolution layer of (a), the number of input channels being set to C_mThe number of output channels is set to σ²k² _upExpanding the channel dimension in the spatial dimension can result in a shape of σ H × σ W × k² _upThe upsampling core of (a); the resulting upsampled kernel is normalized with softmax such that the convolution kernel weight is 1.

Thus, a plurality of fusion features corresponding to the input image are used as feature extraction information corresponding to the input image.

Referring to fig. 10 to 14, in some alternative embodiments, the dual-headed structure includes a convolution layer, a first-stage network including a bounding box extracting unit, a two-stage network and a first-stage regression network, and a second-stage network including first to third multi-stage networks and first to third regression networks, and the step S402 may include steps S701 to S706.

S701: and inputting the feature extraction information corresponding to the input image into the convolution layer to obtain a convolution result corresponding to the input image.

S702: and inputting the convolution result corresponding to the input image into the boundary box extraction unit to obtain the first-stage boundary box information corresponding to the input image.

S703: and acquiring second-stage boundary box information corresponding to the input image by using the first-stage boundary box information corresponding to the input image, the two-classification network and the first-stage regression network.

S704: and acquiring first bounding box information corresponding to the input image by using the second-stage bounding box information corresponding to the input image, the first multi-classification network and the first regression network.

S705: and acquiring second bounding box information corresponding to the input image by using the first bounding box information corresponding to the input image, the second multi-classification network and the second regression network.

S706: and acquiring the prediction detection information corresponding to the input image by using the second bounding box information corresponding to the input image, the third multi-classification network and the third regression network.

In a specific application scenario, as shown in fig. 11 to 14, the corresponding fusion feature M of the input image is used₀、M₁、M₂、M₃A 3 × 3 convolution is performed, and then the convolutions are fed into the first stage Network and the second stage Network, respectively, where the first stage Network is RPN (Region pro-active Network) and the second stage Network is Cascade RCNN. In the first stage, a convolution result corresponding to an input image is input into the boundary box extraction unit to obtain first-stage boundary box information corresponding to the input image, a plurality of anchors (anchors) with fixed size and fixed proportion are artificially set to be used as predicted boundary boxes, and then boundary box information (proposals) with higher confidence coefficient is screened from the anchors (anchors) through a classification network and a regression network to be used as boundary boxes of a second stage. The classification network is a two-class network, only predicts the probability value of whether a target exists in an anchor (anchor), and the regression network predicts the offset, namely, if a target possibly exists in a certain anchor (anchor), the deviation between the anchor (anchor) and the target real bounding box regression (bounding box). Similarly, the second stage network uses the bounding box information (proposals) as the predicted bounding box, and then screens out the final bounding box from the bounding box information (proposals) through a classification network and a regression network. Wherein the number of classes of the multi-class network depends on the number of classes to be detected in the data set. The regression network predicts the offset between all bounding box information (proposals) and the real bounding box regression (bounding box).

The second stage adopts three-stage cascade network for prediction, wherein the first multi-classification network, the second multi-classification network and the third multi-classification network are FC-head, and the first regression network and the second regression networkThe third regression network is Conv-head, and the output of the first-level network is the first bounding box information (propusals)₁) Output second bounding box information (propusals) of the second level network as input bounding box information of the second level network₂) And as the input boundary box information of the third-level network, the output value of the third-level network is the prediction monitoring information corresponding to the input image. FC-head was used as the classification network and Conv-head was used as the regression network.

Therefore, the classification task usually needs more image semantic information, the regression task needs more spatial information, the characteristics of different requirements are considered by adopting a double-head structure, and the effect is more obvious.

Referring to fig. 15, an embodiment of the present application further provides an object detection method, which includes steps S11 to S12.

S11: and acquiring an image to be detected. In some embodiments, the image to be detected may comprise any one of the following: monitoring an image; the traffic image is stored in the camera storage device; a transmitted image of the aircraft.

S12: and inputting the image to be detected into a target detection model to obtain the corresponding prediction detection information of the image to be detected. The target detection model is obtained by training by using any one of the model training methods.

Therefore, the image to be detected is input into the target detection model, and the prediction detection information corresponding to the image to be detected can be accurately and stably obtained.

In one embodiment, the flow of the target detection method is shown in FIG. 16. Firstly, data are obtained, wherein the data can be images of driving scenes and images to be detected acquired by unmanned aerial vehicles. And carrying out data augmentation (image enhancement) on the image to be detected so as to remove the noise of the image to be detected. An image to be detected, which may comprise a marked image and an unmarked image, is input to an encoder (encoder) to extract corresponding first feature information (featL) and second feature information (featu) of the image to be detected. Applying the first feature information (featL) to a Double-headed Cascade RCNN (Cascade RCNN with Double-Head) to learn supervised object detection using a detector network and to obtain predicted detection information corresponding to an image to be detected, and forwarding both the first feature information (featL) and the second feature information (featu) to a gradient inversion layer (GRL) and a Domain Discriminator (Domain Discriminator) to learn Domain invariant characteristics in a resistively manner.

Referring to fig. 17, an embodiment of the present application further provides a model training apparatus, and a specific implementation manner of the model training apparatus is consistent with the implementation manner and the achieved technical effect described in the embodiment of the model training method, and details of a part of the implementation manner and the achieved technical effect are not repeated.

The model training device is used for training a preset deep neural network, the preset deep neural network comprises a prediction module, the prediction module uses Cascade RCNN and uses CBNet as a feature extraction network of the Cascade RCNN, and the model training device comprises: a training data set part 101, configured to obtain a training data set, where each piece of training data in the training data set includes a training image and label detection information corresponding to the training image, and the label detection information corresponding to the training image includes label classification information and label bounding box information corresponding to the training image; and a model training part 102, configured to train the preset deep neural network by using the training data set, so as to obtain a target detection model.

Referring to fig. 18, in some optional embodiments, the preset deep neural network may further include a data augmentation module, and the model training part 102 may include: an augmented image module 201, configured to input at least one training image into the data augmentation module to obtain an augmented image corresponding to the at least one training image; a first training module 202, configured to train the preset deep neural network by using at least one of the training images and corresponding label detection information thereof as a source domain and using an augmented image corresponding to at least one of the training images as an augmented domain, so as to reduce a data distribution difference between the augmented domain and the source domain; the annotation information acquisition module 203 is configured to acquire annotation detection information of an augmented image corresponding to at least one of the training images; a target domain obtaining module 204, configured to obtain a target domain; a second training module 205, configured to use an augmented image corresponding to at least one of the training images and label detection information corresponding to the at least one training image as a new augmented domain, and train the preset deep neural network by using the new augmented domain and the target domain to reduce a data distribution difference between the augmented domain and the target domain; and the target detection module 206 is configured to use the trained preset deep neural network as the target detection model.

Referring to fig. 19, in some alternative embodiments, the data augmentation module may be a generator, and the preset deep neural network may further include a feature extraction module, a gradient inversion layer, and a domain discriminator; the first training module 202 and the second training module 205 may comprise a confrontation training module comprising: a feature extraction sub-module 301, configured to input an input image into the feature extraction module to obtain first feature information and second feature information corresponding to the input image, where the input image has corresponding annotation detection information or does not have corresponding annotation detection information; a first prediction sub-module 302, configured to input first feature information corresponding to the input image into the prediction module to obtain prediction detection information corresponding to the input image, where the prediction detection information corresponding to the input image includes prediction classification information and prediction bounding box information corresponding to the input image; a first training sub-module 303, configured to train the prediction module based on the label detection information and the prediction detection information corresponding to the input image when the input image has the corresponding label detection information; a gradient inversion subsystem module 304, configured to input first feature information and second feature information corresponding to the input image into the gradient inversion layer, so as to obtain gradient inversion information corresponding to the input image; a domain identification submodule 305, configured to input gradient inversion information corresponding to the input image into the domain identifier, so as to obtain domain identification information corresponding to the input image; a counterstudy sub-module 306 for training the generator and the domain discriminator in a counterstudy manner based on the domain discrimination information corresponding to the input image.

Referring to fig. 20, in some alternative embodiments, the prediction module may include a feature extraction network and a dual-headed structure; the first prediction sub-module 302 may include: a feature extraction unit 401, configured to input first feature information corresponding to the input image into the feature extraction network, so as to obtain feature extraction information corresponding to the input image; a double-headed structure unit 402, configured to input feature extraction information corresponding to the input image into the double-headed structure, to obtain prediction detection information corresponding to the input image.

Referring to fig. 21, in some alternative embodiments, the feature extraction unit 401 may include: an image scaling subunit 501, configured to scale the long side of the width and the middle of the height of the input image to a preset length value, and scale the short side of the width and the middle of the height of the input image to any value in a preset length range; an image determination subunit 502 for determining a plurality of input images including the input image; an image padding sub-unit 503 configured to pad the short sides of the remaining input images to a reference value, which is a maximum value of the short sides in the plurality of input images; a batch input subunit 504, configured to input the multiple input images into the feature extraction network in a batch manner, so as to obtain feature extraction information corresponding to the multiple input images, where the feature extraction information corresponding to the multiple input images includes feature extraction information corresponding to the input images.

Referring to FIG. 22, in some alternative embodiments, the feature extraction network may include Stage₁、Stage₂、Stage₃、Stage₄、Stage_{1_1}、Stage_{2_2}、Stage_{3_3}、Stage_{4_4}And first to third upsampling units, the feature extraction unit 401 may include: a first feature map subunit 601, configured to input first feature information corresponding to the input image into Stage₁Obtaining a characteristic map F corresponding to the input image₁(ii) a A second feature map subunit 602, configured to map a feature map F corresponding to the input image₁Input Stage_{1_1}Obtaining a characteristic map F corresponding to the input image₂(ii) a A third feature map subunit 603 for mapping the input imageFeature map F₁Input Stage₂Obtaining a characteristic map F corresponding to the input image₃(ii) a A fourth feature map subunit 604, configured to map a feature map F corresponding to the input image₃And feature map F₂Added to Stage_{2_2}Obtaining a characteristic map F corresponding to the input image₄(ii) a A fifth feature map subunit 605, configured to map a feature map F corresponding to the input image₃Input Stage₃Obtaining a characteristic map F corresponding to the input image₅(ii) a A sixth feature map subunit 606, configured to map a feature map F corresponding to the input image₅And feature map F₄Added to Stage_{3_3}Obtaining a characteristic map F corresponding to the input image₆(ii) a A seventh feature map subunit 607, configured to map the feature map F corresponding to the input image₅Input Stage₄Obtaining a characteristic map F corresponding to the input image₇(ii) a An eighth feature map subunit 608, configured to map a feature map F corresponding to the input image₇And feature map F₆Added to Stage_{4_4}Obtaining a characteristic map F corresponding to the input image₈And corresponding characteristic diagram F of the input image₈As a fusion feature M corresponding to the input image₃(ii) a A third sampling sub-unit 609, configured to apply a feature map F corresponding to the input image₈Inputting the third up-sampling unit to obtain a feature map F corresponding to the input image₈And the feature map F corresponding to the input image is obtained₈And a feature map F corresponding to the input image₆Adding to obtain the corresponding fusion feature M of the input image₂(ii) a A second sampling subunit 610, configured to apply the fused feature M corresponding to the input image₂Inputting a second up-sampling unit to obtain a fusion feature M corresponding to the input image₂And the corresponding fusion feature M of the input image is obtained₂And a feature map F corresponding to the input image₄Adding to obtain the corresponding fusion feature M of the input image₁(ii) a A first sampling sub-unit 611 for merging the corresponding fusion features of the input imageSign M₁Inputting the first up-sampling unit to obtain the fusion feature M corresponding to the input image₁And the corresponding fusion feature M of the input image is obtained₁And a feature map F corresponding to the input image₂Adding to obtain the corresponding fusion feature M of the input image₀(ii) a A feature information subunit 612, configured to blend the features M corresponding to the input image₃The fusion feature M corresponding to the input image₂Fusion feature M corresponding to input image₁Fusion feature M corresponding to the input image₀And extracting information as the characteristic corresponding to the input image.

Referring to fig. 23, in some alternative embodiments, the dual-headed structure may include a convolutional layer, a first-stage network and a second-stage network, the first-stage network may include a bounding box extraction unit, a two-stage network and a first-stage regression network, the second-stage network may include first to third multi-stage networks and first to third regression networks, and the dual-headed structure unit 402 may include: a convolution subunit 701, configured to input feature extraction information corresponding to the input image into the convolution layer, so as to obtain a convolution result corresponding to the input image; a first bounding box subunit 702, configured to input the convolution result corresponding to the input image into the bounding box extraction unit, so as to obtain first-stage bounding box information corresponding to the input image; a second bounding box subunit 703, configured to obtain second-stage bounding box information corresponding to the input image by using the first-stage bounding box information corresponding to the input image, the two-class network, and the first-stage regression network; a first information subunit 704, configured to obtain first bounding box information corresponding to the input image by using the second-stage bounding box information corresponding to the input image, the first multi-classification network, and the first regression network; a second information subunit 705, configured to obtain second bounding box information corresponding to the input image by using the first bounding box information corresponding to the input image, the second multi-classification network, and the second regression network; an information predictor 706, configured to obtain prediction detection information corresponding to the input image by using the second bounding box information corresponding to the input image, the third multi-classification network, and the third regression network.

Referring to fig. 24, an embodiment of the present application further provides a target detection apparatus, and a specific implementation manner of the target detection apparatus is consistent with the implementation manner and the achieved technical effect described in the embodiment of the target detection method, and a part of the content is not described again.

The object detection device includes: the image module to be detected 11 is used for acquiring an image to be detected; the image prediction module 12 is configured to input the image to be detected into a target detection model, so as to obtain prediction detection information corresponding to the image to be detected; wherein, the target detection model is obtained by training by using any one of the model training methods.

Referring to fig. 25, an embodiment of the present application further provides an electronic device 200, where the electronic device 200 includes at least one memory 210, at least one processor 220, and a bus 230 connecting different platform systems.

The memory 210 may include readable media in the form of volatile memory, such as Random Access Memory (RAM)211 and/or cache memory 212, and may further include Read Only Memory (ROM) 213.

The memory 210 further stores a computer program, and the computer program can be executed by the processor 220, so that the processor 220 executes the steps of the model training method or the target detection method in the embodiment of the present application, and a specific implementation manner of the method is consistent with the implementation manner and the achieved technical effect described in the embodiment of the model training method or the target detection method, and details of some of the contents are not repeated.

Memory 210 may also include a utility 214 having at least one program module 215, such program modules 215 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Accordingly, the processor 220 may execute the computer programs described above, and may execute the utility 214.

Bus 230 may be a local bus representing one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or any other type of bus structure.

The electronic device 200 may also communicate with one or more external devices 240, such as a keyboard, pointing device, bluetooth device, etc., and may also communicate with one or more devices capable of interacting with the electronic device 200, and/or with any devices (e.g., routers, modems, etc.) that enable the electronic device 200 to communicate with one or more other computing devices. Such communication may be through input-output interface 250. Also, the electronic device 200 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 260. The network adapter 260 may communicate with other modules of the electronic device 200 via the bus 230. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 200, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, and data backup storage platforms, to name a few.

The embodiment of the present application further provides a computer-readable storage medium, and a specific implementation manner of the computer-readable storage medium is consistent with the implementation manner and the achieved technical effect described in the embodiment of the model training method or the target detection method, and some contents are not repeated.

The computer-readable storage medium is used for storing a computer program or an object detection model; the computer program, when executed, implements the steps of a model training method or a target detection method in embodiments of the present application; the target detection model is obtained by training by any one of the model training methods.

Fig. 26 shows a program product 300 provided by the present embodiment for implementing the above-described model training method or the target detection method, which may employ a portable compact disc read only memory (CD-ROM) and include program codes, and may be executed on a terminal device, such as a personal computer. However, the program product 300 of the present invention is not so limited, and in this application, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. Program product 300 may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that can communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing. Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the C language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

While the present application is described in terms of various aspects, including exemplary embodiments, the principles of the invention should not be limited to the disclosed embodiments, but are also intended to cover various modifications, equivalents, and alternatives falling within the spirit and scope of the invention as defined by the appended claims.

Claims

1. A model training method for training a preset deep neural network, the preset deep neural network including a prediction module using Cascade RCNN and using CBNet as a feature extraction network of Cascade RCNN, the model training method comprising:

acquiring a training data set, wherein each training data in the training data set comprises a training image and label detection information corresponding to the training image, and the label detection information corresponding to the training image comprises label classification information and label bounding box information corresponding to the training image;

and training the preset deep neural network by using the training data set to obtain a target detection model.

2. The model training method of claim 1, wherein the preset deep neural network further comprises a data augmentation module, and the training of the preset deep neural network with the training data set to obtain the target detection model comprises:

inputting at least one training image into the data augmentation module to obtain an augmentation image corresponding to the at least one training image;

taking at least one training image and corresponding label detection information thereof as a source domain, taking an augmented image corresponding to at least one training image as an augmented area, and training the preset deep neural network by using the source domain and the augmented area so as to reduce the data distribution difference between the augmented area and the source domain;

acquiring label detection information of an augmented image corresponding to at least one training image;

acquiring a target domain;

taking an augmented image corresponding to at least one training image and label detection information corresponding to the augmented image as a new augmented domain, and training the preset deep neural network by using the new augmented domain and the target domain to reduce the data distribution difference between the augmented domain and the target domain;

and taking the trained preset deep neural network as the target detection model.

3. The model training method of claim 2, wherein the data augmentation module is a generator, the preset deep neural network further comprises a feature extraction module, a gradient inversion layer and a domain discriminator, and the training of the preset deep neural network comprises:

inputting an input image into the feature extraction module to obtain first feature information and second feature information corresponding to the input image, wherein the input image has corresponding label detection information or does not have corresponding label detection information;

inputting first feature information corresponding to the input image into the prediction module to obtain prediction detection information corresponding to the input image, wherein the prediction detection information corresponding to the input image comprises prediction classification information and prediction bounding box information corresponding to the input image;

when the input image has corresponding label detection information, training the prediction module based on the label detection information and the prediction detection information corresponding to the input image;

inputting first characteristic information and second characteristic information corresponding to the input image into the gradient inversion layer to obtain gradient inversion information corresponding to the input image;

inputting the gradient inversion information corresponding to the input image into the domain discriminator to obtain domain discrimination information corresponding to the input image;

training the generator and the domain discriminator in a counterlearning manner based on domain discrimination information corresponding to the input image.

4. The model training method of claim 3, wherein the prediction module comprises a feature extraction network and a dual-headed structure;

the inputting the first feature information corresponding to the input image into the prediction module to obtain the prediction detection information corresponding to the input image includes:

inputting first feature information corresponding to the input image into the feature extraction network to obtain feature extraction information corresponding to the input image;

and inputting the feature extraction information corresponding to the input image into the double-head structure to obtain the prediction detection information corresponding to the input image.

5. The model training method according to claim 4, wherein the inputting first feature information corresponding to the input image into the feature extraction network to obtain feature extraction information corresponding to the input image comprises:

the long sides in the width and the height of the input image are zoomed to preset length values, and the short sides in the width and the height of the input image are zoomed to any value in a preset length range;

determining a plurality of input images including the input image;

filling the short sides of the rest input images to a reference value by taking the maximum value of the short sides in the plurality of input images as the reference value;

inputting the plurality of input images into the feature extraction network in a batch mode to obtain feature extraction information corresponding to the plurality of input images, wherein the feature extraction information corresponding to the plurality of input images comprises the feature extraction information corresponding to the input images.

6. The model training method of claim 4, wherein the feature extraction network comprises Stage₁、Stage₂、Stage₃、Stage₄、Stage_{1_1}、Stage_{2_2}、Stage_{3_3}、Stage_{4_4}And a first up-sampling unit to a third up-sampling unit, wherein the step of inputting the first feature information corresponding to the input image into the feature extraction network to obtain the feature extraction information corresponding to the input image comprises the steps of:

inputting first characteristic information corresponding to the input image into Stage₁Obtaining a characteristic map F corresponding to the input image₁；

The feature map F corresponding to the input image₁Input Stage_{1_1}Obtaining a characteristic map F corresponding to the input image₂；

The feature map F corresponding to the input image₁Input Stage₂Obtaining a characteristic map F corresponding to the input image₃；

The feature map F corresponding to the input image₃And feature map F₂Added to Stage_{2_2}Obtaining a characteristic map F corresponding to the input image₄；

The feature map F corresponding to the input image₃Input Stage₃Obtaining a characteristic map F corresponding to the input image₅；

The feature map F corresponding to the input image₅And feature map F₄Added to Stage_{3_3}Obtaining a characteristic map F corresponding to the input image₆；

The feature map F corresponding to the input image₅Input Stage₄Obtaining a characteristic map F corresponding to the input image₇；

The feature map F corresponding to the input image₇And feature map F₆Added to Stage_{4_4}Obtaining a characteristic map F corresponding to the input image₈And corresponding characteristic diagram F of the input image₈As a fusion feature M corresponding to the input image₃；

The feature map F corresponding to the input image₈Inputting the third up-sampling unit to obtain a feature map F corresponding to the input image₈And the feature map F corresponding to the input image is obtained₈And a feature map F corresponding to the input image₆Adding to obtain the corresponding fusion feature M of the input image₂；

Corresponding fusion characteristics M of the input image₂Inputting a second up-sampling unit to obtain a fusion feature M corresponding to the input image₂And the corresponding fusion feature M of the input image is obtained₂And a feature map F corresponding to the input image₄Adding to obtain the corresponding fusion feature M of the input image₁；

Corresponding fusion characteristics M of the input image₁Inputting the first up-sampling unit to obtain the fusion feature M corresponding to the input image₁And the corresponding fusion feature M of the input image is obtained₁And a feature map F corresponding to the input image₂Adding to obtain the corresponding fusion feature M of the input image₀；

Corresponding fusion characteristics M of the input image₃The fusion feature M corresponding to the input image₂Fusion feature M corresponding to input image₁Fusion feature M corresponding to the input image₀And extracting information as the characteristic corresponding to the input image.

7. The model training method of claim 4, wherein the double-headed structure comprises a convolutional layer, a first-stage network and a second-stage network, the first-stage network comprises a bounding box extraction unit, a two-class network and a first-stage regression network, the second-stage network comprises a first multi-class network to a third multi-class network and a first regression network to a third regression network, and the inputting the feature extraction information corresponding to the input image into the double-headed structure to obtain the prediction detection information corresponding to the input image comprises:

inputting the feature extraction information corresponding to the input image into the convolution layer to obtain a convolution result corresponding to the input image;

inputting the convolution result corresponding to the input image into the bounding box extraction unit to obtain first-stage bounding box information corresponding to the input image;

acquiring second-stage boundary box information corresponding to the input image by using the first-stage boundary box information corresponding to the input image, the two-classification network and the first-stage regression network;

acquiring first bounding box information corresponding to the input image by utilizing second-stage bounding box information corresponding to the input image, the first multi-classification network and the first regression network;

acquiring second bounding box information corresponding to the input image by using the first bounding box information corresponding to the input image, the second multi-classification network and the second regression network;

and acquiring the prediction detection information corresponding to the input image by using the second bounding box information corresponding to the input image, the third multi-classification network and the third regression network.

8. An object detection method, characterized in that the object detection method comprises:

acquiring an image to be detected;

inputting the image to be detected into a target detection model to obtain the corresponding prediction detection information of the image to be detected;

wherein the object detection model is trained by the model training method according to any one of claims 1 to 7.

9. A model training apparatus for training a preset deep neural network including a prediction module using Cascade RCNN and using CBNet as a feature extraction network of Cascade RCNN, the model training apparatus comprising:

10. An object detection apparatus, characterized in that the object detection apparatus comprises:

the image module to be detected is used for acquiring an image to be detected;

11. An electronic device, characterized in that the electronic device comprises a memory storing a computer program and a processor implementing the steps of the model training method according to any one of claims 1 to 7 or the steps of the object detection method according to claim 8 when the computer program is executed.

12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program or an object detection model;

the computer program when being executed by a processor performs the steps of the model training method of any one of claims 1 to 7 or the steps of the object detection method of claim 8;

the object detection model is trained by using the model training method of any one of claims 1 to 7.