CN115331097A

CN115331097A - Image detection model training method and device and image detection method

Info

Publication number: CN115331097A
Application number: CN202210806731.XA
Authority: CN
Inventors: 杨刚; 王艺莎; 卢昊
Original assignee: Beijing Forestry University
Current assignee: Beijing Forestry University
Priority date: 2022-07-08
Filing date: 2022-07-08
Publication date: 2022-11-11

Abstract

The embodiment of the invention provides an image detection model training method, an image detection model training device and an image detection method, wherein the image detection model training comprises the following steps: carrying out feature extraction on the sample image through a feature extraction network to obtain a first feature map; inputting the first feature map into an image-level domain classifier, and updating the first feature map according to the attention weight of each region in the first feature map to obtain a second feature map; inputting the second feature map into the example detection network to obtain the position area and the category classification result of each example; generating a third feature map according to the second feature map; inputting the third feature map into a target instance-level domain classifier corresponding to the target instance class label to obtain a domain class prediction result output by the target instance-level domain classifier; and training a detection network according to the domain category prediction result, the category classification result and the regression parameter to obtain a target detection model for carrying out example detection on the to-be-detected image of the target domain.

Description

Image detection model training method and device and image detection method

Technical Field

The invention relates to the technical field of computers, in particular to an image detection model training method and device and an image detection method.

Background

The tree crown is an important component of the tree, accurately identifies the information of the tree crown, and can play an active role in the aspects of tree growth detection, tree pest and disease prevention, tree biomass, small shift accumulation amount prediction and the like. Meanwhile, the crown is an important factor for constructing a forest stand model, and after the crown width is identified, the forest stand density can be effectively predicted, the forest competition relationship can be estimated, and the like.

At present, the crown is usually detected by adopting a deep learning method, but the current deep learning method has higher requirements on the shooting conditions of training samples and test samples, can only be suitable for detecting the crown in a specific area, and has poorer detection effect on sample images in different areas or different shooting sources.

Disclosure of Invention

The embodiment of the invention provides an image detection model training method, an image detection model training device and an image detection method, which are used for improving the detection effect of an image detection model.

In a first aspect, an embodiment of the present invention provides an image detection model training method, which is applied to a detection network, where the detection network includes a target detection model, an image-level domain classifier, and instance-level domain classifiers corresponding to different instance classes, and the method includes:

obtaining a sample image and the target detection model to be trained, wherein the sample image comprises a source domain image and a target domain image, the source domain image has an instance category label, the target domain image does not have the instance category label, the source domain image and the target domain image both have the domain category label, the target detection model is obtained by using the source domain image for training in advance, and the target detection model comprises a feature extraction network and an instance detection network;

performing feature extraction on the sample image through the feature extraction network to obtain a first feature map;

inputting the first feature map into the image-level domain classifier to obtain a domain category prediction result output by the image-level domain classifier, determining attention weights of all regions in the first feature map based on the domain category prediction result, and updating the first feature map according to the attention weights of all regions in the first feature map to obtain a second feature map;

inputting the second feature map into the example detection network to obtain the position area and the category classification result of each example;

determining an example category label corresponding to the sample image according to the category classification result of each example and the domain category label corresponding to the sample image;

generating a third feature map according to the second feature map, wherein position areas corresponding to the examples of target example category labels are marked in the third feature map, and the target example category labels are any of the example category labels corresponding to the sample images;

inputting the third feature map into a target instance-level domain classifier corresponding to the target instance class label to obtain a domain class prediction result output by the target instance-level domain classifier;

and training the detection network according to the domain class prediction result output by the image-level domain classifier, the domain class prediction result output by the target instance-level domain classifier, and the class classification result and regression parameter of each instance output by the instance detection network, so as to obtain a target detection model for performing instance detection on the to-be-detected image of the target domain.

In a second aspect, an embodiment of the present invention provides an image detection model training apparatus, which is applied to a detection network, where the detection network includes a target detection model, an image-level domain classifier, and instance-level domain classifiers corresponding to different instance classes, and the apparatus includes:

an obtaining module, configured to obtain a sample image and the target detection model to be trained, where the sample image includes a source domain image and a target domain image, the source domain image has an instance category label, the target domain image does not have an instance category label, both the source domain image and the target domain image have a domain category label, the target detection model is obtained by using the source domain image for training in advance, and the target detection model includes a feature extraction network and an instance detection network;

the extraction module is used for extracting the characteristics of the sample image through the characteristic extraction network to obtain a first characteristic diagram;

the updating module is used for inputting the first feature map into the image-level domain classifier to obtain a domain category prediction result output by the image-level domain classifier, determining the attention weight of each region in the first feature map based on the domain category prediction result, and updating the first feature map according to the attention weight of each region in the first feature map to obtain a second feature map;

the processing module is used for inputting the second feature map into the example detection network to obtain the position area and the category classification result of each example;

a determining module, configured to determine an example category label corresponding to the sample image according to the category classification result of each example and the domain category label corresponding to the sample image;

a generating module, configured to generate a third feature map according to the second feature map, where a position area corresponding to each instance of a target instance category label is marked in the third feature map, and the target instance category label is any one of the instance category labels corresponding to the sample image;

the processing module is further configured to input the third feature map into a target instance-level domain classifier corresponding to the target instance-level class label, so as to obtain a domain class prediction result output by the target instance-level domain classifier;

and the training module is used for training the detection network according to the domain class prediction result output by the image-level domain classifier, the domain class prediction result output by the target instance-level domain classifier, and the class classification result and regression parameter of each instance output by the instance detection network, so as to obtain a target detection model for performing instance detection on the image to be detected of the target domain.

In a third aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to perform the image detection model training method of the first aspect.

In a fourth aspect, an embodiment of the present invention provides a non-transitory machine-readable storage medium having stored thereon executable code, which, when executed by a processor of an electronic device, causes the processor to execute the image detection model training method according to the first aspect.

In a fifth aspect, an embodiment of the present invention provides an image detection method, including:

acquiring an image to be detected and a target detection model;

detecting the model to be detected through the target detection model to obtain example areas and category classification results corresponding to the examples in the image to be detected;

wherein the target detection model is obtained by training with the image detection model training method of the first aspect.

In a sixth aspect, an embodiment of the present invention provides an image detection apparatus, including:

the acquisition module is used for acquiring an image to be detected and a target detection model;

the detection module is used for detecting the model to be detected through the target detection model so as to obtain an example area and a category classification result corresponding to each example in the image to be detected;

In a seventh aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to perform the image detection method of the fifth aspect.

In an eighth aspect, embodiments of the present invention provide a non-transitory machine-readable storage medium having stored thereon executable code, which, when executed by a processor of an electronic device, causes the processor to perform the image detection method according to the fifth aspect.

In the embodiment of the present invention, it is assumed that a detection network including a target detection model, an image-level domain classifier, and instance-level domain classifiers corresponding to different instance classes is desired to be trained, so as to improve an image detection effect of a target detection network in the detection network. Firstly, after a first feature map of a sample image is extracted through a feature extraction network, image-level alignment processing is carried out on a source domain image and a target domain image through an image-level domain classifier to obtain a second feature map, so that migratable regions in the first feature map are highlighted, and negative migration of each region is inhibited. And then, extracting example-level features in the second feature map through the example detection network to obtain a third feature map. And carrying out example level alignment processing on the third feature map through an example level domain classifier so as to avoid error alignment through example category labels of the examples and further improve the image alignment effect. And finally, training the detection network based on the results output by the target detection model, the image level domain classifier and the example level domain classifier, thereby improving the image detection effect of the target detection network.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of an image detection model training method according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of a detection network according to an embodiment of the present invention.

Fig. 3 is a flowchart of an image detection method according to an embodiment of the present invention.

Fig. 4 is a schematic structural diagram of an image detection model training apparatus according to an embodiment of the present invention.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Fig. 6 is a schematic structural diagram of another electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In addition, the sequence of steps in each method embodiment described below is only an example and is not strictly limited.

The image classification method provided by the embodiment of the invention can be executed by an electronic device, and the electronic device can be a terminal device such as a PC (personal computer), a notebook computer and the like, and can also be a server. The server may be a physical server or may also be a virtual server. The server can be a physical or virtual server at the user side, and can also be a cloud server.

The scheme provided by the embodiment of the invention can be used for realizing the classification and detection of the images. In brief, after a user (the user herein may be a researcher who has a need for image classification and detection, etc.) acquires different images, the classification and detection of the images may be implemented by executing the scheme provided by the embodiment of the present invention, so that the accuracy and efficiency of image classification may be improved.

Taking a crown detection scene as an example, the rapid and accurate crown detection has important significance for forestry management and accurate forestry. The crown is the main place for tree photosynthesis and is an important part of the tree. The method can accurately identify the crown information, and can play a good role in the aspects of detection of tree growth vigor, prevention of tree diseases and insect pests, prediction of tree biomass and small shift accumulation amount and the like. In addition, the crown is an important factor for constructing a forest stand model, and after the width of the crown is identified, the forest stand density can be effectively predicted, the forest competition relationship can be estimated, and the like. Therefore, accurate and real-time crown detection in large-scale areas is of great significance in production and research. However, in the crown detection scenario, the huge spatial scale and the variability of data make the traditional deep learning method difficult to be applied to the cross-regional crown detection scenario.

The current deep learning method has higher requirements on shooting conditions of training samples and test samples, can only be suitable for crown detection in a specific area, and has poorer detection effect on sample images in different areas or different shooting sources.

It should be noted that the image detection model training method and the image detection method provided by the invention can be applied to a crown detection scene, and can also be applied to image detection scenes in other fields.

The following embodiment is used to describe the training process of the image detection model training method provided by the present invention in detail.

Fig. 1 is a flowchart of an image detection model training method according to an embodiment of the present invention. The training method can be applied to a detection network, wherein the detection network comprises a target detection model, an image-level domain classifier and instance-level domain classifiers corresponding to different instance classes. The method comprises the following steps:

s101, obtaining a sample image and a target detection model to be trained.

S102, carrying out feature extraction on the sample image through a feature extraction network to obtain a first feature map.

S103, inputting the first feature map into an image-level domain classifier to obtain a domain category prediction result output by the image-level domain classifier, determining the attention weight of each area in the first feature map based on the domain category prediction result, and updating the first feature map according to the attention weight of each area in the first feature map to obtain a second feature map.

And S104, inputting the second feature map into the example detection network to obtain the position area and the category classification result of each example.

And S105, determining an example category label corresponding to the sample image according to the category classification result of each example and the domain category label corresponding to the sample image.

And S106, generating a third feature map according to the second feature map.

And S107, inputting the third feature map into a target instance-level domain classifier corresponding to the target instance-level domain label to obtain a domain class prediction result output by the target instance-level domain classifier.

S108, training a detection network according to the domain category prediction result output by the image-level domain classifier, the domain category prediction result output by the target instance-level domain classifier, the category classification result and the regression parameter of each instance output by the instance detection network, so as to obtain a target detection model for instance detection of the to-be-detected image of the target domain.

In order to improve the detection performance of the target detection model on the target domain, the feature distribution of the source domain and the target domain can be aligned step by aligning the image level and the instance level between the source domain image and the target domain image. Specifically, the target detection model may be used as a reference model, and two domain classifier structures for feature alignment, such as an image-level domain classifier and an instance-level domain classifier, may be added to the reference model to achieve the above-described image-level and instance-level alignment between the source domain image and the target domain image.

The example-level domain classifier may include a plurality of example-level domain classifiers, and in particular, the number of example-level domain classifiers may be determined according to the number of classes of examples in the sample image. Taking the sample image as the crown image in the crown detection scene as an example, the instance categories in the crown image include a crown instance category and a background instance category, and two corresponding instance-level domain classifiers may be provided, including an instance-level domain classifier corresponding to the crown instance category and an instance-level domain classifier corresponding to the background instance category.

First, a sample image and a target detection model to be trained are acquired. In the present embodiment, the sample image includes a source domain image and a target domain image.

Wherein, the source domain image and the target domain image are images which contain the same category and have similarity in shooting angle, image texture or background. Further, the source domain image corresponds to an instance class label, while the target domain image does not. In addition, the source domain image and the target domain image respectively correspond to a domain category label, that is, the source domain image corresponds to an active domain label, and the target domain image corresponds to a target domain label. In this embodiment, the source domain image and the target image are both images containing crown information, and correspondingly, the instance category label may include a foreground (crown) category label and a background category label in the image, and area coordinate information corresponding to each category label.

The target detection model is obtained by using source domain image training in advance, and comprises a feature extraction network and an example detection network. The instance detection Network may include a Region generation Network (RPN), and the RPN may be configured to generate Region coordinate information of each instance. In this embodiment, the target detection model is fast-RCNN.

And then, carrying out feature extraction on the sample image through a feature extraction network to obtain a first feature map. After the first feature map is obtained, the first feature map is input to an image-level domain classifier to obtain a domain class prediction result output by the image-level domain classifier, the attention weight of each region in the first feature map is determined based on the domain class prediction result, and the first feature map is updated according to the attention weight of each region in the first feature map to obtain a second feature map.

According to a domain type prediction result output by the image-level domain classifier, whether the image-level domain classifier can successfully distinguish the region can be determined for a certain region in the source domain image or the target image, if the image-level domain classifier can successfully distinguish the region, the region is indicated to be representative characteristics of foreground images of the source domain and the target domain, and therefore the region can be endowed with a large attention weight, so that the effects of enhancing the region with high mobility and restraining the region with poor mobility are achieved.

And then, inputting the second feature map into the example detection network to obtain the position area and the classification result of the category of each example. In the present embodiment, the location area of each instance may be determined by area coordinate information generated by the area generation network. Meanwhile, the example category label corresponding to the sample image can be determined according to the category classification result of each example and the domain category label corresponding to the sample image.

And then, generating a third feature map according to the second feature map. In this embodiment, the third feature map is marked with a position area corresponding to each instance of the target instance category label, and the target instance category label is any one of the instance category labels corresponding to the sample image. Specifically, after the second feature map is input to the instance detection network to obtain the location areas of the instances, the obtained location areas of the instances may be mapped to the second feature map, so as to obtain the third feature map.

Wherein, the source domain image has an instance class label, and the target and the image have no instance class label. Thus, the corresponding instance class labels are different for sample images of different domain classes. In this embodiment, if the sample image is a source domain image, the instance type label of the source domain image is directly adopted; and if the sample image is the target domain image, adopting the classification result of each example as an example classification label of the target domain image.

After the instance class labels corresponding to the instances in the third feature map are determined, the third feature map may be input to a target instance-level domain classifier corresponding to the target instance class labels, so as to obtain a domain class prediction result output by the target instance-level domain classifier.

And finally, training the detection network according to the domain class prediction result output by the image-level domain classifier, the domain class prediction result output by the target instance-level domain classifier and the class classification result of each instance output by the instance detection network to obtain a target detection model for performing instance detection on the to-be-detected image of the target domain.

In addition, in another embodiment, the target detection model may further include Strong week fast-RCNN, and in this embodiment, the target detection model adds two domain classifier branches attached to the feature extraction network on the basis of fast-RCNN for achieving image-level alignment between the source domain image and the target domain image. Meanwhile, an instance-level domain classifier is added on the full connection layer and used for achieving instance-level alignment between the source domain image and the target domain image. Based on the respective structures described above, a structure for detecting a network is constructed.

It should be noted that, in the case that the target detection model is Strong Weak fast-RCNN, since the Strong Weak fast-RCNN includes the domain classifier branch and has the same function as the image-level domain classifier described above, the domain classifier included in the Strong fast-RCNN is the image-level domain classifier described above.

The following describes in detail the training process of the detection network and the training method of the image detection model provided by the present invention with reference to fig. 2.

Fig. 2 is a schematic structural diagram of a detection network according to an embodiment of the present invention. As shown in FIG. 2, the detection network includes an object detection model, a multi-level migratable attention module, and a multi-confrontation instance level alignment module. The multi-level transferable attention module is used for determining the attention weight of each area in the sample image based on the output of the image level domain classifier, and weighting the feature map of the corresponding level of the sample image based on the attention weight to obtain the feature map with attention. It should be noted that, in the present embodiment, the target detection model is Strong Weak fast-RCNN.

In this embodiment, the feature extraction network of the target detection model includes a first feature extraction network and a second feature extraction network, and a sampling receptive field corresponding to the first feature extraction network is smaller than a sampling receptive field corresponding to the second feature extraction network. Correspondingly, the image-level domain classifier comprises a first image-level domain classifier corresponding to the first feature extraction network and a second image-level domain classifier corresponding to the second feature extraction network.

The detection network further comprises a first gradient inversion layer positioned between the feature extraction network and the image-level domain classifier, and the first gradient inversion layer is used for realizing the countermeasure training of the feature extraction network and the image-level domain classifier. Correspondingly, the detection network further comprises a second gradient inversion layer positioned between the instance detection network and the target instance-level domain classifier, and the second gradient inversion layer is used for realizing the countermeasure training of the instance detection network and the target instance-level domain classifier. In this embodiment, the instance level detection network includes a region generation network and a full connectivity layer.

When the detection network is trained, firstly, a source domain image and a target domain image are sequentially input into a first feature extraction network, the first feature extraction network performs feature extraction on the source domain image and the target domain image, and outputs a first feature map. And inputting the first feature map to a first image-level domain classifier through a first gradient inversion layer, performing domain classification on each pixel in the first feature map by the first image-level domain classifier, and outputting a first classification feature map which has the same width and the same height as the first feature map. Wherein the value of each pixel in the first classification profile represents a probability that the corresponding location in the respective first profile belongs to the source domain.

Based on the first classification feature map, a local attention value may be determined for each pixel in the first feature map. The local attention value can be calculated by the following formula (1).

Wherein the content of the first and second substances,

representing the local attention value of region k in image i;

representing the probability that the region k in the image i belongs to the source domain;

to relate to

Wherein:

where j =1 or 0, j =1 indicates that the probability that the region k in the image i belongs to the source domain is smaller than the probability that the region k belongs to the target domain, and j =0 indicates that the probability that the region k in the image i belongs to the source domain is greater than the probability that the region k belongs to the target domain.

For a certain area in the sample image, if the first image-level domain classifier can distinguish the source domain from the target domain, the area is the representative feature of foreground images of the source domain and the target domain, and accordingly a larger attention value needs to be given to the area. However, a wrong attention value may have a negative impact on the domain adaptation task to some extent, and in order to reduce this impact, the robustness of the detection network to the wrong attention value may be enhanced by the deep residual attention structure.

The weighting calculation of the first feature map may be implemented based on the following formula (2).

Wherein, the first and the second end of the pipe are connected with each other,

characteristic of expression pair

The updated features obtained after the weighting are performed,

which represents the region k in the first feature map corresponding to image i.

And after the first feature map is updated, inputting the updated first feature map into a second feature extraction network, and performing feature extraction on the first feature map and outputting a high-level feature map by the second feature extraction network. And inputting the updated first feature map to a second image-level domain classifier through a first gradient inversion layer, performing domain classification on each pixel in the updated first feature map by the second image-level domain classifier, and outputting a second classification feature map which has the same width and the same height as the updated first feature map. Wherein the value of each pixel in the second classification feature map represents a probability that the corresponding location in the respective first feature map belongs to the source domain.

Based on the second classification feature map, a global attention value may be determined for each pixel in the high level feature map. Then, the high-level feature map is updated based on the local attention value of each pixel in the high-level feature map to obtain a second feature map. It should be noted that the processing method for updating the high-level feature map is the same as the processing method for updating the first feature map, and the description of the present invention is omitted here.

It should be noted that, the network structures of the second domain classifier and the first domain classifier are different, a problem that the sizes of the feature image output by the second domain classifier and the high-level feature map are inconsistent occurs, and the high-level feature map cannot be directly subjected to weighted updating processing. Therefore, an upsampling layer can be added into the second domain classifier, so that the feature image output by the second domain classifier is consistent with the size of the high-level feature map for subsequent weighting updating operation.

After the second feature map is obtained, the second feature map is input into an area generation network in the target detection model, the area generation network sorts and screens out a plurality of candidate areas with the highest scores, and after the Roi-Align operation, a third feature map is obtained, and position areas of all example features corresponding to the target example type labels are marked in the third feature map. And after the third characteristic diagram passes through the full connection layer of the target detection model, outputting the classification results corresponding to the instances.

In the training process, the image-level domain classifier can mix the sample image features from the source domain and the target domain, but in order to further improve the detection performance of the detection model, the feature distribution of the local instances needs to be further focused.

To avoid misaligning instances from different domains into different instance classes, resulting in false or missed detection, which in turn leads to a reduction in model performance. In this embodiment, that is, in the crown detection scenario, if the crown feature of the foreground in the source domain is aligned with the background feature in the target domain, false detection or missed detection may be caused, which may further cause a problem of degradation of the model performance.

Therefore, in this embodiment, the example-level domain classifier may include a plurality of, specifically, taking the sample image as the crown image in the crown detection scene as an example, the example categories in the crown image include a foreground example category and a background example category, and then two corresponding example-level domain classifiers may be provided, including an example-level domain classifier corresponding to the foreground example category and an example-level domain classifier corresponding to the background example category.

Before the example-level features are input into the example-level domain classifier, the example class labels corresponding to the sample images need to be determined. Specifically, if the sample image is a source domain image, the instance type label of the source domain image is directly adopted. And if the sample image is the target domain image, adopting the classification result of each example as an example classification label of the target domain image.

And then, inputting the third feature map into a target instance-level domain classifier corresponding to the target instance-level domain label to obtain a domain class prediction result output by the target instance-level domain classifier.

In this embodiment, the image-level alignment loss corresponding to the image-level domain classifier is determined according to the domain-level prediction result output by the image-level domain classifier and the domain-level labels respectively corresponding to the sample images. And determining the example-level alignment loss corresponding to the target example-level domain classifier according to the domain-type prediction result output by the target example-level domain classifier and the domain-type labels corresponding to the examples. And determining entropy regularization loss and detection loss corresponding to the example detection network according to the class classification result of each example output by the example detection network and the example class label corresponding to the source domain image. Training the detection network based on the image-level alignment loss, the instance-level alignment loss, the entropy regularization loss, and the detection loss.

Specifically, the image-level alignment loss includes a first image-level alignment loss corresponding to the first image-level-domain classifier and a second image-level alignment loss corresponding to the second image-level-domain classifier.

Wherein the first image-level alignment loss can be calculated by the following formula (3).

Wherein L is _loc (F _l ,D _l ) Representing a first image level alignment penalty; l is _locs Representing a first image-level alignment penalty corresponding to the source domain; l is _loct Representing a first image-level alignment loss corresponding to the target domain; f _l Representing a first feature extraction network; d _l Representing a first image-level domain classifier.

The first image-level alignment loss corresponding to the source domain can be calculated by the following equation (4).

Wherein n is _s Representing the number of source domain images; w represents the width of the first feature map; h represents the height of the first profile; f _l Representing a first feature extraction network; d _l Representing a first image-level domain classifier;

an image i representing the source domain.

The first image-level alignment loss corresponding to the target domain can be calculated by the following equation (5).

Wherein n is _t Representing the number of target domain images; w represents the width of the first feature map; h represents the height of the first profile; f _l Representing a first feature extraction network; d _l Representing a first image-level domain classifier;

image i representing the target domain.

The second image-level alignment loss can be calculated by the following equation (6).

Wherein L is _global (F _g ,D _g ) Representing a second image level alignment penalty, L _globals Representing a second image-level alignment penalty corresponding to the source domain; l is a radical of an alcohol _globalt Representing a second image-level alignment loss corresponding to the target domain; f _g Representing a second feature extraction network; d _g Representing a second image-level domain classifier.

The second image-level alignment loss corresponding to the source domain can be calculated by the following equation (7).

Wherein n is _s Representing the number of source domain images; f _g Representing a second feature extraction network; d _g Representing a second image-level domain classifier;

the updated first feature map corresponds to the image i representing the target domain; γ (γ ≧ 0) denotes an increasing focus parameter.

The second image-level alignment loss corresponding to the target domain can be calculated by the following equation (8).

Wherein n is _t Representing the number of target domain images; f _g Representing a second feature extraction network; d _g Representing a second image-level domain classifier;

In addition, the goal of countermeasures-based domain adaptation is to align the sample features of the source domain and the target domain, so that the domain classifier cannot correctly judge the domain class of the current features. To achieve this, it is necessary for the model to note the confusing samples that are far from the domain classification boundary and try to get them close to the domain boundary. Therefore, the domain class prediction result output by the target instance-level domain classifier captures the samples that are difficult to confuse, and weights the samples so that the samples obtain larger weight. It should be noted that the larger the domain classification probability is, the sample is an confusable sample, and otherwise, the sample is an confusable sample.

Wherein, the weight value of the instance-level alignment loss can be calculated by the following formula (9).

W _i,j,c ＝-plog(1-p) (9)

Wherein, W _i,j,c Represents example feature m _(i,j) A corresponding weight value;

d represents a domain label of the sample image;

1 denotes a source domain label, 0 denotes a target domain label, p _i,j,c Represents example feature m _(i,j) Probability of belonging to the source domain, m _(i,j) Representing the jth example feature in image i.

The weighted example level alignment penalty can be calculated by the following equation (10).

L _multi-ins-w ＝L _inss-w +L _inst-w (10)

Wherein L is _multi-ins-w Representing instance level alignment loss; l is _inss-w Representing an instance-level alignment penalty corresponding to the source domain; l is _inst-w Indicating the corresponding instance-level alignment penalty for the target domain.

The example level alignment penalty for the source domain can be calculated by the following equation (11).

Wherein n is _s Representing the number of source domain images;

representing a classification result corresponding to the source domain image; p is a radical of _i,j,c Represents example feature m _(i,j) Probability of belonging to the source domain, m _(i,j) Representing the jth example feature, W, in image i _i,j,c And the domain classification probability of the jth instance in the picture i output by the instance-level domain classifier corresponding to the class c is shown.

The example-level alignment penalty corresponding to the target domain can be calculated by the following equation (12).

Wherein n is _t Representing the number of target domain images;

representing a classification result of a category corresponding to the target domain image; p is a radical of _i,j,c Represents example feature m _(i,j) Probability of belonging to the target domain.

In this embodiment, the entropy regularization loss may be determined according to a class classification result corresponding to a target domain image output by an example detection network. And determining detection loss according to a class classification result corresponding to the source domain image output by the example detection network and an example class label corresponding to the source domain image. It should be noted that the detection loss includes a classification loss determined by a classification result corresponding to the source domain image output by the example detection network, a regression loss determined by the regression parameter and the real bounding box coordinates of the example feature in the source domain image, and a classification loss and a regression loss generated by the region generation network.

The entropy regularization loss can be calculated by the following equation (13).

In this embodiment, the detection loss corresponding to the example detection network may be calculated by referring to the detection loss in the related art, which is not described herein again.

According to the embodiment of the invention, firstly, after a first feature map of a sample image is extracted through a feature extraction network, image-level alignment processing is carried out on a source domain image and a target domain image through an image-level domain classifier to obtain a second feature map, so that migratable regions in the first feature map are highlighted, and negative migration of each region is inhibited. And then, extracting example-level features in the second feature map through the example detection network to obtain a third feature map. And performing example level alignment processing on the third feature map through an example level domain classifier to avoid error alignment through example category labels of examples and further improve the image alignment effect. And finally, training the detection network based on the results output by the target detection model, the image level domain classifier and the example level domain classifier, thereby improving the image detection effect of the target detection network.

The following examples are provided to illustrate the image detection method of the present invention.

Fig. 3 is a flowchart of an image detection method according to an embodiment of the present invention. The method comprises the following steps:

s301, obtaining an image to be detected and a target detection model.

S302, detecting the model to be detected through the target detection model to obtain an example area and a category classification result corresponding to each example in the image to be detected. The target detection model is obtained by training by adopting the image detection model training method in the embodiment.

Fig. 4 is a schematic structural diagram of an image detection model training apparatus according to an embodiment of the present invention, as shown in fig. 4, the apparatus is applied to a detection network, and the detection network includes a target detection model, an image-level domain classifier, and instance-level domain classifiers corresponding to different instance classes. The device includes: an acquisition module 401, an extraction module 402, an update module 403, a processing module 404, a determination module 405, a generation module 406, and a training module 407.

The obtaining module 401 is configured to obtain a sample image and a target detection model to be trained, where the sample image includes a source domain image and a target domain image, the source domain image has an instance category label, the target domain image does not have the instance category label, both the source domain image and the target domain image have a domain category label, the target detection model is obtained by using the source domain image for training in advance, and the target detection model includes a feature extraction network and an instance detection network.

An extracting module 402, configured to perform feature extraction on the sample image through a feature extraction network to obtain a first feature map.

An updating module 403, configured to input the first feature map into the image-level domain classifier to obtain a domain category prediction result output by the image-level domain classifier, determine an attention weight of each region in the first feature map based on the domain category prediction result, and update the first feature map according to the attention weight of each region in the first feature map to obtain a second feature map.

And the processing module 404 is configured to input the second feature map into the instance detection network to obtain a location area and a category classification result of each instance.

The determining module 405 is configured to determine an example category label corresponding to the sample image according to the category classification result of each example and the domain category label corresponding to the sample image.

A generating module 406, configured to generate a third feature map according to the second feature map, where a position area corresponding to each instance of the target instance category label is marked in the third feature map, and the target instance category label is any one of the instance category labels corresponding to the sample image.

The processing module 404 is further configured to input the third feature map into a target instance-level domain classifier corresponding to the target instance-level domain label, so as to obtain a domain class prediction result output by the target instance-level domain classifier.

The training module 407 is configured to train the detection network according to the domain category prediction result output by the image-level domain classifier, the domain category prediction result output by the target instance-level domain classifier, and the category classification result and the regression parameter of each instance output by the instance detection network, so as to obtain a target detection model for performing instance detection on the to-be-detected image of the target domain.

According to the embodiment of the invention, if the sample image is the source domain image, the instance type label of the source domain image is directly adopted; and if the sample image is the target domain image, adopting the classification result of each example as an example classification label of the target domain image.

According to the embodiment of the invention, the feature extraction network comprises a first feature extraction network and a second feature extraction network, wherein the sampling receptive field corresponding to the first feature extraction network is smaller than the sampling receptive field corresponding to the second feature extraction network; the image-level domain classifier includes a first image-level domain classifier corresponding to the first feature extraction network and a second image-level domain classifier corresponding to the second feature extraction network.

According to the embodiment of the invention, the detection network further comprises a first gradient inversion layer positioned between the feature extraction network and the image-level domain classifier, and the first gradient inversion layer is used for realizing the confrontation training of the feature extraction network and the image-level domain classifier; the detection network also comprises a second gradient inversion layer positioned between the example detection network and the target example-level domain classifier, and the second gradient inversion layer is used for realizing the antagonistic training of the example detection network and the target example-level domain classifier.

According to the embodiment of the present invention, the training module 407 is further configured to determine, according to the domain category prediction result output by the image-level domain classifier and the domain category labels respectively corresponding to the sample images, an image-level alignment loss corresponding to the image-level domain classifier; determining the example-level alignment loss corresponding to the target example-level domain classifier according to the domain category prediction result output by the target example-level domain classifier and the domain category labels corresponding to the examples respectively; determining entropy regularization loss and detection loss corresponding to the example detection network according to the class classification result of each example output by the example detection network and the example class label corresponding to the source domain image; training the detection network based on the image-level alignment loss, the instance-level alignment loss, the entropy regularization loss, and the detection loss.

According to the embodiment of the present invention, the training module 407 is further configured to determine an entropy regularization loss according to a class classification result corresponding to the target domain image output by the example detection network; and determining detection loss according to a class classification result corresponding to the source domain image output by the example detection network and an example class label corresponding to the source domain image.

In one possible design, the structure of the image classification apparatus shown in fig. 4 may be implemented as an electronic device. As shown in fig. 5, the electronic device 500 may include: a processor 501 and a memory 502. The memory 502 has executable codes stored thereon, and when the executable codes are executed by the processor 501, at least the processor 501 can implement the image detection model training method provided in the embodiment shown in fig. 1.

The control device may further include a communication interface 503 for communicating with other devices.

Fig. 6 is a schematic structural diagram of another electronic device according to an embodiment of the present invention, and as shown in fig. 6, the electronic device 600 may include one or more of the following components: a processing component 602, a memory 604, a power component 606, a multimedia component 608, an audio component 610, an input/output (I/O) interface 612, a sensor component 614, and a communication component 616.

The processing component 602 generally controls overall operation of the electronic device 600, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 602 may include one or more processors 620 to execute instructions to perform all or some of the steps of the methods S101-S108 described above. Further, the processing component 602 can include one or more modules that facilitate interaction between the processing component 602 and other components. For example, the processing component 602 can include a multimedia module to facilitate interaction between the multimedia component 608 and the processing component 602.

The memory 604 is configured to store various types of data to support operations at the electronic device 600. Examples of such data include instructions for any application or method operating on the electronic device 600, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 604 may be implemented by any type or combination of volatile and non-volatile storage devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 606 provides power to the various components of the electronic device 600. The power components 606 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 600.

The multimedia component 608 includes a screen that provides an output interface between the electronic device 600 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 608 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 600 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 610 is configured to output and/or input audio signals. For example, the audio component 610 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 600 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 604 or transmitted via the communication component 616. In some embodiments, audio component 610 further includes a speaker for outputting audio signals.

The input/output interface 612 provides an interface between the processing component 602 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 614 includes one or more sensors for providing status assessment of various aspects of the electronic device 600. For example, the sensor component 614 may detect an open/closed state of the electronic device 600, the relative positioning of components, such as a display and keypad of the electronic device 600, the sensor component 614 may also detect a change in position of the electronic device 600 or a component of the electronic device 600, the presence or absence of user contact with the electronic device 600, orientation or acceleration/deceleration of the electronic device 600, and a change in temperature of the electronic device 600. The sensor assembly 614 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 614 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 614 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 616 is configured to facilitate communications between the electronic device 600 and other devices in a wired or wireless manner. The electronic device 600 may access a wireless network based on a communication standard, such as WiFi,2G or 3G or 4G or a combination thereof. In an exemplary embodiment, the communication component 616 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 616 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 600 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 604 comprising instructions, executable by the processor 620 of the electronic device 600 to perform the above-described method is also provided. For example, the non-transitory computer-readable storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

In addition, an embodiment of the present invention provides a non-transitory machine-readable storage medium, on which executable code is stored, and when the executable code is executed by a processor of an electronic device, the processor is caused to execute the image detection model training method provided in the foregoing embodiment shown in fig. 1.

The above-described apparatus embodiments are merely illustrative, wherein the various modules illustrated as separate components may or may not be physically separate. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that the embodiments can be implemented by adding necessary general hardware platform, and of course, can also be implemented by a combination of hardware and software. With this understanding in mind, the above-described aspects and certain aspects that make contribution to the present invention may be embodied in the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied therein (including, but not limited to, disk storage, CD-ROM, optical storage, etc.).

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An image detection model training method is applied to a detection network, wherein the detection network comprises a target detection model, an image-level domain classifier and instance-level domain classifiers corresponding to different instance classes, and the method comprises the following steps:

obtaining a sample image and the target detection model to be trained, wherein the sample image comprises a source domain image and a target domain image, the source domain image has an instance category label, the target domain image does not have the instance category label, both the source domain image and the target domain image have the domain category label, the target detection model is obtained by using the source domain image for training in advance, and the target detection model comprises a feature extraction network and an instance detection network;

inputting the first feature map into the image-level domain classifier to obtain a domain category prediction result output by the image-level domain classifier, determining attention weights of regions in the first feature map based on the domain category prediction result, and updating the first feature map according to the attention weights of the regions in the first feature map to obtain a second feature map;

2. The method according to claim 1, wherein the determining the instance class label corresponding to the sample image according to the class classification result of each instance and the domain class label corresponding to the sample image comprises:

if the sample image is a source domain image, directly adopting an instance type label of the source domain image;

and if the sample image is the target domain image, adopting the classification result of each example as an example classification label of the target domain image.

3. The method of claim 1, wherein the feature extraction network comprises a first feature extraction network and a second feature extraction network, and wherein a sampling receptive field corresponding to the first feature extraction network is smaller than a sampling receptive field corresponding to the second feature extraction network;

the image-level domain classifier includes a first image-level domain classifier corresponding to the first feature extraction network and a second image-level domain classifier corresponding to the second feature extraction network.

4. The method of claim 1, further comprising a first gradient inversion layer between the feature extraction network and the image-level domain classifier for implementing a countermeasure training of the feature extraction network and the image-level domain classifier;

the detection network further comprises a second gradient inversion layer positioned between the instance detection network and the target instance-level domain classifier, and the second gradient inversion layer is used for realizing the antagonistic training of the instance detection network and the target instance-level domain classifier.

5. The method of claim 1, wherein training the detection network based on the domain class prediction output by the image-level domain classifier, the domain class prediction output by the target instance-level domain classifier, and the class classification output by the instance detection network for each instance comprises:

determining image-level alignment loss corresponding to the image-level domain classifier according to the domain-level prediction result output by the image-level domain classifier and the domain-level labels respectively corresponding to the sample images;

determining the example-level alignment loss corresponding to the target example-level domain classifier according to the domain category prediction result output by the target example-level domain classifier and the domain category labels corresponding to the examples; and

determining entropy regularization loss and detection loss corresponding to the example detection network according to the class classification result of each example output by the example detection network and the example class label corresponding to the source domain image;

training the detection network based on the image-level alignment loss, the instance-level alignment loss, the entropy regularization loss, and the detection loss.

6. The method according to claim 5, wherein the determining, according to the class classification result of each instance output by the instance detection network and the instance class label corresponding to the source domain image, an entropy regularization loss and a detection loss corresponding to the instance detection network comprises:

determining the entropy regularization loss according to the class classification result corresponding to the target domain image output by the example detection network;

and determining the detection loss according to the class classification result corresponding to the source domain image and the instance class label corresponding to the source domain image, which are output by the instance detection network.

7. An image detection method, comprising:

acquiring an image to be detected and a target detection model;

wherein the target detection model is obtained by training by using the image detection model training method of any one of claims 1 to 6.

8. An image detection model training device, applied to a detection network including a target detection model, an image-level domain classifier and instance-level domain classifiers corresponding to different instance classes, the device comprising:

and the training module is used for training the detection network according to the domain category prediction result output by the image-level domain classifier, the domain category prediction result output by the target instance-level domain classifier, and the category classification result and the regression parameter of each instance output by the instance detection network so as to obtain a target detection model for performing instance detection on the image to be detected in the target domain.

9. An electronic device, comprising: a memory, a processor; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to perform the image detection model training method of any one of claims 1 to 6.

10. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the image detection model training method of any one of claims 1 to 6.