CN115035409A

CN115035409A - Weak supervision remote sensing image target detection algorithm based on similarity comparison learning

Info

Publication number: CN115035409A
Application number: CN202210698556.7A
Authority: CN
Inventors: 张浩鹏; 谭智文; 姜志国; 谢凤英; 赵丹培
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2022-06-20
Filing date: 2022-06-20
Publication date: 2022-09-09
Anticipated expiration: 2042-06-20
Also published as: CN115035409B

Abstract

The invention relates to a weak supervision remote sensing image target detection algorithm based on similarity comparison learning, which comprises the steps of firstly obtaining a similarity matrix by a feature extraction neural network, combining candidate frame detection scores to obtain a similarity candidate frame cluster, and generating a merging candidate frame; and then, obtaining a merging candidate frame comparison loss according to the similarity candidate frame cluster, inputting the merging candidate frame into a feature extraction neural network to obtain multiple example losses and refining losses, updating the training feature extraction neural network by combining the merging candidate frame comparison loss, finally, detecting the test image by using the trained feature extraction neural network to obtain a merging candidate frame of the test image, and inputting the merging candidate frame into a weak supervision detection framework to obtain a detection result. The algorithm disclosed by the invention can generate a new candidate frame set with uniform size distribution and reduced proportion of small candidate frames, and can obtain the candidate frame characteristics with discriminative power, thereby improving the capability of the algorithm for distinguishing the rich semantic candidate frames.

Description

Weak supervision remote sensing image target detection algorithm based on similarity comparison learning

Technical Field

The invention relates to the technical field of digital image processing, in particular to a weak supervision remote sensing image target detection algorithm based on similarity contrast learning.

Background

The target detection technology is defined as a technology for finding targets of interest in an image according to image features and determining the positions and the categories of the targets. With the development of deep learning, the target detection technology of the convolutional neural network trained on the labeled image set is mature. However, the mainstream object detection technology, such as fast-RCNN, YOLOv3, SSD, etc., needs to use labels at the object level, that is, for the images in the training set of the training model, specific position and size information of the object in the image needs to be given. Today, remote sensing data is growing explosively, and objects in remote sensing images tend to be densely clustered and oriented to any two features. Therefore, obtaining target-level annotations of remote sensing images is quite time-consuming and labor-consuming.

To address the problem of target-level labeling being difficult to obtain, researchers have proposed and developed target detection algorithms based on weakly supervised learning. Unlike the traditional deep learning target detection technology based on target-level labeling, the weak supervision target detection technology uses image-level labeling, that is, a training set only needs to give which types of targets exist in an image, and does not need to give position and size information of the targets specifically, as shown in fig. 2. During the testing process, the weakly supervised object detection algorithm can still predict the location and size of the objects of the category of interest. For the increasingly huge remote sensing image data sets, the weakly supervised target detection algorithm avoiding fine target level labeling has better application prospect.

However, most of the existing weak supervision target detection algorithms are researched and tested on a natural data set, and when the algorithm is transferred to a remote sensing image data set, a huge problem exists: small box dominance phenomenon, as in fig. 3. The reasons for this phenomenon are mainly two: 1. the mainstream weak supervision target detection algorithm needs to firstly obtain a large number of candidate frames from an input image by using a candidate frame extraction algorithm, and compared with a natural image, a remote sensing image is more complex in background, and the texture of a remote sensing target is clear and the structure is complex, so that the candidate frame extraction algorithm can generate a large number of small candidate frames. The candidate frames play a leading role in updating network parameters in the training process, so that a large number of small frames are detected as targets instead of large detection frames containing all targets and rich in information in the testing process; 2. due to the complex background of the remote sensing image, the characteristic extraction part in the weak supervision target detection frame is difficult to learn the characteristic representation with distinctiveness, and under the interference of a large amount of background noise, the algorithm is difficult to correctly predict the large candidate frame with rich semantics.

Therefore, it is an urgent need to solve the above-mentioned problems by those skilled in the art to provide a weakly supervised target detection algorithm capable of overcoming the above-mentioned drawbacks.

Disclosure of Invention

In view of the above, the invention provides a weak supervised remote sensing image target detection algorithm based on similarity comparison learning, which realizes the update training of the original feature extraction part through the similarity merging candidate frame generation step and the comparison learning step, thereby generating a new candidate frame set with a small candidate frame greatly reduced in proportion to detect a more complete target.

The method comprises the following concrete steps:

a weak supervision remote sensing image target detection algorithm based on similarity comparison learning comprises,

s11, extracting candidate frame features of the training image initial candidate frame based on the feature extraction neural network, and calculating similarity by adopting a cosine similarity criterion to obtain a similarity matrix;

s12, obtaining a candidate frame detection score according to the MIL network branches;

s13, determining a center candidate frame index according to the similarity matrix and the candidate frame detection score to obtain a similarity candidate frame cluster;

s14, generating a combining candidate frame by utilizing the similarity candidate frame cluster;

s15, obtaining a positive and negative sample set according to the similarity candidate frame cluster, and calculating the loss of the similarity candidate frame cluster according to a comparison loss function based on candidate frames according to the similarity score between the candidate frame and a center candidate frame in the positive and negative sample set;

s16, calculating the contrast loss of the merging candidate frame according to the loss of the similarity candidate frame cluster;

s17, inputting the merging candidate box into the feature extraction neural network, and obtaining multi-example loss and refining loss of the merging candidate box through an MIL network branch and a refining network branch;

s18, combining the contrast loss, the multi-example loss and the refining loss, and updating and training the feature extraction neural network according to the gradient back propagation;

and S19, detecting the test image by using the trained feature extraction neural network, executing the step S11-14 to obtain a merging candidate frame of the test image, inputting the merging candidate frame of the test image into the feature extraction neural network, and sequentially passing through the MIL network branch and the refining network branch to obtain a detection result.

Preferably, in S11, the obtaining of the similarity matrix includes,

calculating the similarity between the candidate frames according to the cosine similarity formula,

in the formula, p _i 、p _j Features representing the ith and jth candidate boxes;

obtaining a similarity matrix M _F ，

Wherein M is _F ∈R ^m×m M is the total number of candidate frames,

for the similarity matrix M _F Row i and column j.

Preferably, in S13, the obtaining the similarity candidate frame cluster includes the following steps,

step one, setting the candidate frame as an available state available;

step two, according to the candidate frame detection score obtained by the MIL model, searching the candidate frame with the highest score, defining the candidate frame as a center candidate frame, recording the index of the center candidate frame as Centerj,

step three, extracting a centrj column vector in the similarity matrix, and recording the centrj column vector as Fj;

step four, searching the element position index higher than the threshold value in the Fj, and forming a similarity candidate frame cluster Cj together with the index of the center candidate frame;

step five, setting the candidate frame participating in the index as an unavailable state unavailable;

and step six, repeating the step two to the step four in the candidate frames which are in the available state available to obtain a new similarity candidate frame cluster until all the candidate frames are set as unavailable or the upper limit of the cycle times is reached.

Preferably, in S14, according to the position and size information of all candidate frames in the similarity candidate frame cluster, a minimum bounding rectangle is calculated as a new candidate frame one, and coordinates are recorded as: [ x ] of ₁ ^new ,y ₁ ^new ,x ₂ ^new ,y ₂ ^new ]。

Preferably, in S15, the obtaining positive and negative sample sets includes:

selecting a candidate frame corresponding to any index from the similarity candidate frame cluster as a positive sample, and marking as posj;

the element position indexes in Fj which are lower than the threshold value are negative sample index sets, Nj indexes are selected from the negative sample index sets to be used as the negative sample sets, and are marked as the negative sample sets,

preferably, the loss of the similarity candidate frame cluster is calculated by the following formula:

wherein, delta is a hyper-parameter,

the xth element of the Fj vector represents the cosine similarity score between the jth candidate frame feature and the xth candidate frame feature, where x is pos _j Or neg _j ⁱ 。

Preferably, the contrast loss function is obtained by the following formula:

wherein K is the number of the similarity candidate frame clusters.

According to the technical scheme, compared with the prior art, in the algorithm disclosed by the invention, the similarity candidate frame generation network obtains the new candidate frames with more uniform size distribution by utilizing the similarity criterion and the specially designed new candidate frame generation algorithm; and constructing a comparison sample set by the comparison learning network based on the candidate frame, and enhancing the feature expression capability of a feature extraction part in the weak supervision target detection framework by a comparison loss function based on the candidate frame, so that the capability of distinguishing the rich semantic candidate frame by an algorithm is improved, and the generation quality of a similarity candidate frame generation module is promoted. Through the algorithm disclosed by the invention, a new candidate frame set with more uniform size distribution and greatly reduced proportion of small candidate frames can be generated, so that a more complete target is detected, and the detection effect is greatly improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a diagram of an overall framework of a weak supervised remote sensing image target detection algorithm based on similarity comparison learning, provided by the invention;

FIG. 2 is a diagram illustrating the difference between image-level labeling and target-level labeling provided by the present invention;

FIG. 3 is a diagram illustrating an example of a small frame dominance phenomenon provided by the present invention;

fig. 4 is a diagram showing a comparison between a detection result of the detection algorithm of the present invention and a detection result of the existing weak supervision detection algorithm.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a weak supervised remote sensing image target detection algorithm based on similarity contrast learning, which constructs a similarity merging candidate frame generation network and a contrast learning network on the basis of an original feature extraction part and an MIL branch, and effectively solves the problem of small frame domination when a weak supervised target detection method is applied to a remote sensing image;

specifically, the provided similarity candidate frame generation network can obtain a new candidate frame with more balanced size distribution and richer semantics, and effectively solves the problem of excessive small noise frames when the remote sensing image generates the candidate frame;

the comparison learning network based on the candidate box improves the feature expression capability of a feature extraction part in a weak supervision target detection framework and further improves the detection performance of the algorithm.

Meanwhile, the algorithm disclosed by the invention can be applied to the existing weak supervision target detection algorithm as a plug-in, and the detection performance of the algorithm can be improved without destroying the original frame. In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Specifically, as shown in fig. 1, the training phase, i.e. the process of similarity comparison learning, is as follows,

firstly, extracting candidate frame characteristics of an initial candidate frame of a training image based on a characteristic extraction neural network, calculating similarity by adopting a cosine similarity criterion to obtain a similarity matrix,

the generation of the similarity matrix is obtained by calculating the similarity between the features of the candidate frames, for each candidate frame, the feature extraction part obtains the feature p of the corresponding candidate frame, and then the cosine similarity criterion is adopted to calculate the similarity, which is specifically as follows: for the similarity between the ith candidate frame and the jth candidate frame, calculating the similarity score between the two candidate frames by adopting a cosine similarity formula, namely calculating the cosine value of an included angle between the characteristic vectors of the two candidate frames;

further, a similarity matrix is obtained, denoted as M _F ，

Wherein, M _F ∈R ^m×m M is the total number of candidate frames,

for the similarity matrix M _F Row i and column j;

secondly, obtaining a candidate frame detection score according to the MIL network branch, determining a center candidate frame index according to the similarity matrix and the candidate frame detection score to obtain a similarity candidate frame cluster,

the part designs a similarity candidate frame cluster generating algorithm (SPC) to obtain a series of similarity candidate frame clusters by using the similarity matrix obtained in the last part and the detection score of each candidate frame obtained by the MIL model.

Let the candidate frame be b, first all the candidate frames are taken from b ₀ To b _m And sequentially indexing. Each candidate frame b _i A corresponding score vector q can be obtained in the MIL branch _i . For the category c (indicating that the category target exists in the image) of which the value is 1 in the true value image level label corresponding to the input image, executing the following steps to obtain a plurality of final similarity candidate frame clusters;

the method comprises the following specific steps:

step one, setting the candidate frame as an available state available;

step two, according to the candidate frame detection score obtained by the MIL branch, searching the candidate frame with the highest category c score from all available candidate frames, defining the candidate frame as a center candidate frame, recording the index of the center candidate frame as the center j,

step three, extracting a centrj column vector in the similarity matrix for the center candidate frame, and recording the centrj column vector as Fj;

step four, searching all element position indexes higher than a threshold value in the Fj, and forming a similarity candidate frame cluster Cj together with the index of the center candidate frame;

step five, setting all candidate frames participating in the index into an unavailable state unavailable;

step six, continuously repeating the step two to the step four in the candidate frames which are in the available state available to obtain new similarity candidate frame clusters until all the candidate frames are set as unavailable or the upper limit of the cycle times is reached;

and generating candidate frame clusters for all categories with truth labels being 1 according to the steps and combining the candidate frame clusters together to obtain a required similarity candidate frame cluster set C { C1.., CK }, wherein K is the total cycle number, namely the number of the candidate frame clusters.

Further, generating a combined candidate frame by using the similarity candidate frame cluster, specifically calculating a minimum bounding rectangle according to the position and size information of all candidate frames in the similarity candidate frame cluster, and taking the minimum bounding rectangle as a new candidate frame, wherein coordinates are as follows: [ x ] of ₁ ^new ,y ₁ ^new ,x ₂ ^new ,y ₂ ^new ]，

Specifically, for each similarity candidate frame cluster Cj, the position and size information of the candidate frame corresponding to the index therein is found. When using the candidate frame extraction algorithm, the position and size information of all candidate frames of each image is obtained simultaneously, including the coordinates of the upper left corner and the lower right corner of each candidate frame, and the coordinate is [ x ] ₁ ,y ₁ ,x ₂ ,y ₂ ]In the form of x and y representing coordinate values of the point on the respective coordinate axes,

for the Cj candidate frame cluster, the new merging candidate frame directly calculates the minimum circumscribed rectangle of the candidate frames corresponding to the indexes in all the clusters, and the coordinates of the minimum circumscribed rectangle are recorded as the information of the jth new candidate frame.

Obtaining a positive and negative sample set according to the similarity candidate frame cluster: as known from the process of generating the similarity merge candidate boxes, each similarity candidate box cluster Cj includes an index Centerj of the center candidate box and a plurality of candidate box indexes with similarity higher than a threshold value with the center candidate box,

selecting a candidate frame corresponding to any index from the similarity candidate frame cluster, namely randomly selecting a candidate frame corresponding to one index in Cj as a positive sample, and marking the positive sample as pos _j ；

The negative sample set is obtained in a similar manner, and for the Center candidate box, the Center of the MF is extracted _j Column vector, denoted as F _j ，

Searching element position indexes lower than a threshold value in the Fj as a negative sample index set, selecting Nj indexes from the negative sample index set, and recording a set contained by the indexes as a negative sample set

Wherein the content of the first and second substances,

| Cjneg | represents the number of elements in the set;

then, according to the similarity score between the candidate frames in the positive and negative sample sets and the center candidate frame, calculating the loss of each similarity candidate frame cluster by using a comparison loss function based on the candidate frames;

the corresponding calculation formula is:

wherein, delta is a hyper-parameter,

the xth element of the Fj vector represents the cosine similarity score between the jth candidate box feature and the xth candidate box feature, where x is pos _j Or neg _j ⁱ . The right part of the above expression is an expression of the comparison loss function based on the candidate frame, and each candidate is obtained through the function expressionContrast loss for boxed clusters.

Calculating a contrast loss function according to the loss of the similarity candidate frame cluster and the following formula;

wherein K is the number of the similarity candidate frame clusters.

Inputting the merging candidate box into the feature extraction neural network, and obtaining multi-example loss and refining loss of the merging candidate box through an MIL network branch and a refining network branch; meanwhile, the feature extraction neural network is updated and trained according to gradient back propagation in combination with the contrast loss;

specifically, the merged candidate frame is sent to the weak supervision target detection framework in a similar manner to the original candidate frame, loss for the merged candidate frame is obtained, and gradient updating is performed on the feature extraction part of the neural network together with the original candidate frame loss and the contrast loss, so that the neural network can focus more on the candidate frame region with rich semantic features, and the frame domination phenomenon is reduced.

And finally, entering a testing stage:

and detecting the test image by using the trained feature extraction neural network, executing the step S11-14 to obtain a merging candidate frame of the test image, inputting the merging candidate frame of the test image into the feature extraction neural network, and sequentially passing through the MIL network branch and the refining network branch to obtain a detection result.

Through the guidance of the loss function, the feature extraction capability of the feature extraction part in the weak supervision target detection algorithm framework can be effectively improved, the difference between the features of the obtained target-related candidate box and the features of the background noise is larger, and the generation quality of the similarity merging candidate box is further improved in turn.

Further, the method and the device adopt a weak supervision remote sensing image target detection algorithm based on similarity comparison learning to carry out target detection of interested categories on the remote sensing image. In the experimental part, two published remote sensing image datasets were used: HRSC2016 and NWPU VHR-10. The HRSC2016 contains 1061 remote sensing ship images with sizes varying from 300 × 300 to 1500 × 900. The data set contains a total of four categories: aircraft carriers, commercial ships, attack ships and civil ships, including 436 training images, 181 verification images and 444 test images. The NWPU VHR-10 comprises 650 remote sensing images with different sizes, and 10 types of targets are contained in total: airplanes, ships, storage bins, basketball courts, tennis courts, baseball fields, playgrounds, ports, bridges, and vehicles. Of these, 455 training images and 195 test images.

The evaluation indexes used in the experiment were mAP and Corloc, both according to the PASCAL VOC standard. Wherein mAP is tested on a test set, and the higher the value is, the better the detection result is; CorLoc is tested on a training set, and higher values represent better training of the algorithm.

Table 1 shows the performance of the method disclosed by the present invention compared with other weakly supervised target detection algorithms on the HRSC2016 dataset,

Method	mAP	CorLoc
			WSDDN	1.46	8.35
PCL	6.14	17.98
			OICR	15.67	19.31
Ours	39.79	55.29

TABLE 1

As can be seen from the table, the detection performance is greatly improved by the algorithm, and the algorithm can be better applied to the field of remote sensing images. Meanwhile, fig. 4 shows a comparison graph of the detection result of the detection algorithm of the present invention and the detection result of the existing weak supervision detection algorithm, wherein (a) is WSDDN, (b) is oic r, (c) is PCL, and (d) is the detection result of the algorithm of the present invention, so that the algorithm of the present invention effectively solves the problem of small frame domination, can detect a more complete target, and greatly improves the detection effect.

Table 2 shows the results of the algorithm of the present application on the NWPU VHR-10 dataset in comparison with other algorithms,

Method	mAP	CorLoc
			OICR	10.83	14.63
PCL	12.42	18.80
			Ours	33.80	52.32

TABLE 2

Obviously, the algorithm has better performance on a multi-class (10-class) remote sensing data set than other weak supervision object detection algorithms.

The results show that the algorithm can effectively solve the problem that the traditional weak supervision target detection algorithm dominates small frames appearing in remote sensing images, greatly improves the detection effect and detection indexes, and fully proves the effectiveness of the algorithm and the application value of the weak supervision target detection algorithm in the field of remote sensing.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A weak supervision remote sensing image target detection algorithm based on similarity comparison learning is characterized by comprising the following steps:

s18, combining the contrast loss, the multi-sample loss and the refining loss, and updating and training the feature extraction neural network according to the gradient back propagation;

2. The similarity comparison learning-based weakly supervised remote sensing image target detection algorithm according to claim 1, wherein in S11, the obtaining of the similarity matrix includes,

calculating the similarity between the candidate frames according to the cosine similarity formula according to the characteristics of the candidate frames,

in the formula, p _i 、p _j Features representing the ith and jth candidate boxes,

obtaining a similarity matrix M _F ，

Wherein, M _F ∈R ^m×m M is the total number of the candidate frames,

for the similarity matrix M _F Row i and column j.

3. The similarity comparison learning-based weakly supervised remote sensing image target detection algorithm according to claim 1, wherein in S13, the obtaining of the similarity candidate frame cluster includes the following steps,

step one, setting the candidate frame as an available state available;

step two, according to the candidate frame detection score obtained by the MIL model, searching the candidate frame with the highest score, defining the candidate frame as a Center candidate frame, and recording the index of the Center candidate frame as the Center _j ，

Step three, extracting a centrj column vector in the similarity matrix, and recording the centrj column vector as F _j ；

Step four, searching the F _j The element position index higher than the set threshold τ and the index of the center candidate frame form a similarity candidate frame cluster Cj;

4. The weak supervised remote sensing image target detection algorithm based on similarity comparison learning of claim 1, wherein in S14, according to the position and size information of all candidate frames in the similarity candidate frame cluster, a minimum bounding rectangle is calculated as a new candidate frame one, and the coordinates are recorded as: [ x ] of ₁ ^new ,y ₁ ^new ,x ₂ ^new ,y ₂ ^new ]。

5. The similarity comparison learning-based weakly supervised remote sensing image target detection algorithm as recited in claim 3, wherein in S15, the obtaining positive and negative sample sets includes:

selecting a candidate frame corresponding to any index from the similarity candidate frame cluster as a positive sample, and marking the positive sample as pos _j ；

Said F _j The element position index of which the middle is lower than the threshold value is a negative sample index set, and N is selected from the negative sample index set _j The index is used as a negative sample set and is recorded as

Wherein the content of the first and second substances,

6. the weak supervised remote sensing image target detection algorithm based on similarity comparison learning of claim 5, wherein the loss of the similarity candidate frame cluster is calculated by the following formula:

wherein, delta is a hyper-parameter,

is represented by F _j The x-th element of the vector represents the cosine similarity score between the jth candidate box feature and the x-th candidate box feature, where x is pos _j Or neg _j ⁱ 。

7. The similarity comparison learning-based weakly supervised remote sensing image target detection algorithm according to claim 6, wherein the contrast loss function is obtained by the following formula:

wherein K is the number of the similarity candidate frame clusters.