CN115311502A

CN115311502A - Remote sensing image small sample scene classification method based on multi-scale double-flow architecture

Info

Publication number: CN115311502A
Application number: CN202211128397.3A
Authority: CN
Inventors: 李阳阳; 陈茜; 毛鹤亭; 焦李成; 尚荣华; 李玲玲; 马文萍
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-09-16
Filing date: 2022-09-16
Publication date: 2022-11-08

Abstract

The invention discloses a remote sensing image small sample scene classification method based on a multi-scale double-flow architecture, which mainly solves the problems that image discrimination in the prior art has information loss and is easily influenced by complex background of a remote sensing image and severe change of object scale. The scheme is as follows: acquiring a data set for data preprocessing; randomly sampling the preprocessed data set to generate a support set and a query set for training, verifying and testing; constructing an integral double-flow network consisting of a global flow network, a local flow network and a key area positioning module; defining loss functions of global flow and local flow, training and verifying the whole double-flow network to obtain an optimal network model; and classifying the test samples by using the optimal network model to obtain a scene classification result. The method reduces the loss of the discrimination information of the remote sensing image, avoids the influence of the severe change of the complex background and the object scale on the scene classification, improves the classification precision, and can be used for natural disaster detection, urban planning, environment monitoring and vegetation investigation.

Description

Remote sensing image small sample scene classification method based on multi-scale double-flow architecture

Technical Field

The invention belongs to the technical field of remote sensing image recognition, and particularly relates to a remote sensing image small sample scene classification method which can be used for natural disaster detection, city planning, environment monitoring, vegetation mapping and land cover analysis.

Background

Remote sensing is a detection technology for obtaining target information remotely, and with the rapid development of the remote sensing technology, remote sensing images play an increasingly important role in the military and civil fields. Scene classification divides each remote sensing image into different scene categories according to image content, is an important means for understanding the remote sensing images, and has wide application prospects in the fields of natural disaster detection, urban planning, environment monitoring, vegetation mapping, land coverage analysis and the like.

Before the deep learning attracts attention, the scene classification model mainly classifies by the bottom-layer features and middle-layer features that can be actually applied to the extracted image, and performs an encoding operation based on these features. In recent years, deep learning models have shown powerful learning capabilities based on the advent of large available data sets, advances in machine learning theory, and increases in available computing resources. The convolutional neural network is one of the most mainstream deep learning models in image processing, and is also the most used network model with the best performance in the field of scene classification.

However, remote sensing image scene classification based on deep learning has two basic problems. Firstly, the model depends on a large number of label training samples, the acquisition of artificially labeled high-resolution remote sensing images is very difficult and time-consuming, and if the available label data is insufficient, the deep learning model has the risk of overfitting, thereby causing performance degradation. Secondly, the deep neural network can predict the trained test samples of the scene classes with higher accuracy, but in the face of unseen class samples in the training stage, the model is difficult to classify the class samples.

In this context, research on scene classification of remote sensing images with small samples is receiving wide attention. Small sample learning is a new direction of research inspired by the ability of human fast learning, which enables machine vision systems to quickly learn new tasks from limited annotation data. Many existing small sample learning models focus on designing different architectures. For example, the method for learning based on metric finds an optimal metric space by designing different structures and metric modes, and the method for learning based on meta-learning guides a learning algorithm by designing a meta-learner, so that a model can be expected to be quickly generalized to a new task.

Li provides a deep small sample learning remote sensing scene classification method in a paper DLA-MatchNet for now-shot remote sensing scene classification. The method combines the channel attention module and the space attention module with the feature network by designing the self-adaptive discrimination learning matching network and adopting an attention mechanism and feature fusion scheme, thereby improving the feature representation capability. However, in the method, since the image is extracted as a compact image-level representation, most of the discrimination information is lost, and especially when the number of training samples is small, the loss is difficult to recover and affects the final classification result.

The peak provides a method for classifying small sample scenes of remote sensing images based on a double prototype network in patent document 202111495585.5, and the method designs two operations of prototype self-calibration and prototype mutual calibration, so that a prototype is more representative in a training process and is more beneficial to subsequent classification prediction based on the prototype, but the method is not subjected to deep extraction of the hierarchical features of the remote sensing images and is easily interfered by information such as complex backgrounds and scale transformation, and the classification precision is reduced.

Disclosure of Invention

The invention aims to provide a remote sensing image small sample scene classification method based on a multi-scale double-flow architecture aiming at the defects in the prior art, so as to reduce the loss of image discrimination information, avoid the influence of severe changes of complex background and object scale on scene classification and improve the classification precision.

In order to achieve the purpose, the technical scheme of the invention comprises the following steps:

(1) Acquiring three different remote sensing image data sets from an open website, and sequentially carrying out cutting, random horizontal turning, random brightness enhancement, random color enhancement and random contrast enhancement on images in the data sets;

(2) Randomly sampling the preprocessed data set to obtain a training support set S ₁ And training the query set T ₁ Verification support set S ₂ And verifying the query set T ₂ Test support set S ₃ And test query set T ₃ ；

(3) Constructing an integral double-flow network:

3a) Establishing a global flow network formed by connecting an attention depth embedding module A, a category-related attention module B and a measurement module C;

3b) Selecting an existing original network as a local stream network;

3c) Establishing a key area positioning module consisting of vector construction operation and greedy boundary search;

3d) The global flow network and the local flow network are connected through a key area positioning module to obtain an integral double-flow network;

(4) Using training support set S ₁ And training the query set T ₁ Training the whole double-flow network by a small sample scene training method to obtain a trained double-flow network;

(5) Will input the verification support set S ₂ And verifying the query set T ₂ Input to post-training dual-stream networkFine-tuning network parameters, and storing the network with the highest index as an optimal double-flow network model;

(6) Test support set S ₃ And test query set T ₃ And inputting the data into an optimal double-flow network model to obtain a final classification result.

Compared with the prior art, the invention has at least one or more of the following technical effects:

1. according to the invention, a double-flow network is constructed, and the probability of the class to which the sample belongs is calculated from the whole image and the most important region respectively, so that the loss of image discrimination information and the influence of object scale change are reduced.

2. According to the method, the attention characteristic diagram related to the scene category is obtained by designing the category-related attention module, the weight of the descriptor can be increased in the measurement process, and the interference of background information on the scene classification is reduced.

3. The invention obtains the key area of the image by designing the key area positioning, can quickly position the area with the maximum information amount in the global image, and can connect the global stream and the local stream so as to highlight the important objects which are beneficial to scene classification.

The experimental result shows that compared with the existing other methods, the method has better scene classification precision.

Drawings

FIG. 1 is a general flow chart of an implementation of the present invention;

FIG. 2 is a sub-flowchart of the category-dependent attention module computing the output attention feature map M of the present invention;

FIG. 3 is a sub-flowchart of establishing a key zone location module according to the present invention.

Detailed Description

The following describes in detail specific embodiments and effects of the present invention with reference to the accompanying drawings.

Referring to fig. 1, the implementation steps of the present invention are as follows:

step 1, acquiring a remote sensing image data set, and carrying out data preprocessing on the remote sensing image data set:

three different remote sensing image data sets are obtained from a public website, images in the data sets are cut to 224 multiplied by 224, and then the cut images are sequentially subjected to preprocessing of random horizontal turning, random brightness enhancement, random color enhancement and random contrast enhancement to obtain preprocessed remote sensing images.

And 2, randomly sampling the preprocessed data set to obtain a support set and a query set.

Randomly sampling the preprocessed data set to obtain a training support set S ₁ And training the query set T ₁ Verification support set S ₂ And verifying the query set T ₂ Test support set S ₃ And test query set T ₃ The method is concretely realized as follows:

2.1 ) the preprocessed three different remote sensing image data sets are divided respectively, namely, the NWPU-RESISC45 data set is divided into three parts according to the ratio of 25:10:10 into a training set, a verification set and a test set, and dividing a WHU-RS19 data set into 9:5:5, dividing the UC-Merced data set into a training set, a verification set and a test set according to the proportion of 10:6:5, dividing the ratio into a training set, a verification set and a test set;

2.2 C categories are randomly selected from the training set of each data set, K images are randomly sampled from each category, and the C x K images form a training support set S ₁ Simultaneously, randomly selecting N images in each category of the C categories of residual images in equal quantity to form a training query set T ₁ ；

2.3 C categories are randomly selected from the verification set of each data set, K images are randomly sampled from each category, and a verification support set S is formed by the C x K images ₂ (ii) a Then, N images are selected randomly in each category of the C categories of residual images in equal quantity to form a verification query set T ₂ ；

2.4 C) randomly selecting C categories from the test set of each data set, randomly sampling K images in each category, and forming a test support set S by the C x K images ₃ Randomly selecting N images in each category of the C categories of residual images in equal quantity to form a test query set T ₃ 。

And 3, constructing an integral double-flow network.

3.1 Establish a global flow network:

3.1.1 Build an attention depth embedding module A composed of a convolution layer, four convolution blocks, an average pooling layer and a cascade of 1 × 1 convolution layers with 128 channels; the convolution layer consists of 7 multiplied by 7 convolution filters with the channel number of 64 and 3 multiplied by 3 maximum pooling operation, each convolution block consists of 4 3 multiplied by 3 convolution filters, one jump connection operation is added after every two convolution filters, the channel number of the first convolution block is 64, the channel number of the second convolution block is 128, the channel number of the third convolution block is 256, and the channel number of the fourth convolution block is 512;

3.1.2 Create category-dependent attention module B for computing an output attention profile M:

referring to fig. 2, the specific calculation of the category-related attention module B is as follows:

the input features are weighted by W _g Get the compression characteristic f _g (x _j ) Pass weight of W _k The global attention feature map f is obtained by the fully connected layer and the softmax function _k (x _j )；

For compression characteristic f _g (x _j ) And a global attention feature map f _k (x _j ) Weighted summation is carried out, and the result is sequentially weighted as

The full connection layer, the ReLU activation function, a weight of

The full connection layer and the Sigmoid function to obtain a weight vector d _i ：

Where δ denotes the ReLU activation function, σ denotes the Sigmoid activation function,

and

weights for both fully connected networks for scaling down and expanding the feature map dimensions, respectively, N representing the total number of feature map pixels, f _g (x _j )＝W _g ·x _j Is a compression characteristic, f _k (x _j )＝softmax(W _k ·x _j ) Is to compute an attention feature map along pixel point j,

representing a matrix multiplication;

vector d of weights _i And after the sum of the product and the input feature point is added, nonlinear activation is carried out through a Sigmoid function, and the final output M (x) of the module is obtained:

M(x)＝Sigmoid(∑d _i f _i )

wherein, f _i Representing the characteristics of the input ith channel.

3.1.3 A build metrics module C for computing the similarity between the output given query image q to the class C

The implementation is as follows:

for each descriptor x _i First, find its k nearest neighbors in class c

Recalculate x _i And each of

The similarity between them;

and carrying out weighted summation on the similarity of the descriptors by using the attention diagram to obtain the similarity between the given query image q and the class c

Wherein, M (x) _i ) The feature diagram of attention is shown in x _i Response value at location, x _i I-th descriptor representing q, m represents the total number of descriptors,

denotes x _i The jth nearest neighbor in class c, cos (-), represents the cosine similarity between the two vectors, where other distance functions may also be used;

3.1.4 The input end of the category-related attention module B is connected to the output end of the second convolution block of the attention depth embedding module a, and then the output end of the category-related attention module B and the output end of the last 1 × 1 convolution layer of the attention depth embedding module a are simultaneously connected to the measurement module C, and the output of the measurement module C is the output of the global stream network.

3.2 The existing prototype network is selected as a local flow network, key areas of the query image and the support image are input, and the probability of the category of the key area of the query image is obtained, and the method is realized as follows:

3.2.1 Input support set image key area and query set image key area are sent into Resnet18 network for feature extraction, respectively obtain support sample feature f _φ (x _i ) And query sample features f _φ (x _q ) Using the support sample feature f _φ (x _i ) Prohape representation c of class k features in a compute support set _k ：

Wherein S _k Representing a data set of class k in the support set, x representing S _k Y is its corresponding class, c _k I.e. the prototype representation of class kIs the average of all embedded features in the support set for that category;

3.2.2 Computing the embedded features f of the query image _φ (x _q ) And class k prototype representation c _k Distance between them, get the query image x _q The probability of belonging to class k is:

wherein d (-) represents a distance function, c _k′ A prototype representation of the representation category k'.

3.3 Establishing a key area positioning module to obtain the key area coordinate B = [ x ] of the attention feature map M _a ，x _b ，y _a ，y _b ]Referring to fig. 3, the key region locating module is constructed as follows:

3.3.1 Carry out vector construction operation, i.e. along the height and width directions of the space, the attention feature map is aggregated into two one-dimensional structure energy vectors:

wherein the content of the first and second substances,

is an energy vector obtained by polymerization in the width direction,

is an energy vector aggregated along the height direction, M represents the obtained attention feature map, M (i, W) represents a value of the feature map M at the (i, W) position, M (H, j) represents a value of the feature map M at the (H, j) position, H represents the total height, and W represents the total width;

3.3.2 Greedy boundary search is performed on the one-dimensional energy vectors to locate the most important one-dimensional region, and the coordinate point B = [ x ] of the critical region boundary box is obtained _a ，x _b ，y _a ，y _b ]：

To calculate the width boundary [ x ] of the critical region _a ，x _b ]For example, the greedy boundary search is implemented by:

first, the width coordinate x of the feature map is initialized ₁ And x ₂ And defining the key region as the smallest occupied area and containing energy not less than the total energy proportion E _Tr A region of (E) _[x1，x2] /E _[0：W] ＞E _Tr In which E _Tr A hyper-parameter representing the energy ratio,

represents a width vector V _w The sum of the energies of all the elements in (b),

representing the width x from space ₁ To x ₂ The sum of the energies of the regions of (a);

next, iteratively adjust [ x ₁ ，x ₂ ]Of the boundary of (2) to enable energy thereof

Converge to E _Tr Nearby: when ratio of

Higher than E _Tr When is in [ x ] ₁ ，x ₂ ]The region needs to be shrunk in the direction of slowest energy drop until the ratio is not higher than E _Tr Until the end; when ratio of

Lower than E _Tr At the same time, the area needs to be enlarged along the direction of fastest energy rise until the energy is not lower than E _Tr Until the end;

then, mapping the boundary coordinates from the characteristic diagram to the input picture to obtain the width boundary coordinates [ x ] of the key region of the mapped input picture _a ，x _b ]：

x _a ＝I _w x ₁ /W

x _b ＝I _w x ₂ /W

Wherein, I _w W represents the width of the feature map for inputting the width of the picture;

finally, the width boundary coordinate [ x ] is adopted and obtained _a ，x _b ]Obtaining the height boundary coordinate [ y ] of the key region of the input picture by the same calculation method _a ，y _b ]。

3.4 The output end of the attention module B related to the category of the global flow network is connected with the local flow network through the key area positioning module, and the whole double-flow network is obtained.

Step 4, utilizing the training support set S ₁ And training the query set T ₁ And training the whole double-flow network by a small sample scene training method to obtain the trained double-flow network.

4.1 Set the maximum training iteration number to be 300000, the initial learning rate to be 0.0001, and attenuate the learning rate every 100000 generations;

4.2 Add extra margin in the space of the existing cosine loss function to construct an improved cosine loss function L _s ：

Wherein N represents the total number of samples in the query set,

query image q to class representing global stream network computation

The similarity between the two groups is similar to each other,

query images q through class c representing global stream network computation _j The similarity between the images is shown in the specification, wherein M is an added extra margin hyper-parameter, k is the number of nearest neighbors, and M is the number of descriptors of a query image q;

4.3 ) cosine loss function L to be improved _s With the existing central loss function L _c Adding as a loss function L of a global flow network _g ：

L _g ＝L _s +L _c

Wherein the content of the first and second substances,

in the formula (I), the compound is shown in the specification,

representing support samples s _i Class center of global features of (1); m is the size of each scene set, in each set, the class center is computed by averaging the global features of the corresponding support class;

4.4 Computing a negative log probability loss function L from the true label and predicted probability distribution of the image _l It is set as the local flow network loss function:

wherein N represents the total number of samples in the query set, and C represents the total number of categories in the query set.

4.5 A training support set S ₁ Images and training query set T in (1) ₁ The images in the step (2) are input into a double-flow network in batches, and the value of a loss function of the images is calculated according to the probability of image prediction categories output by a global flow network and a local flow network;

4.6 Adopting Adam algorithm to carry out back propagation on the loss value and adjusting network parameters;

4.7 Steps 4.5) -4.6) are repeated until a preset maximum training iteration number is reached, and a trained double-current network is obtained.

Step 5, verifying the double-flow network:

input the verification support set S ₂ And verifying the query set T ₂ And inputting the network parameters into the trained double-flow network for fine adjustment, storing the network with the highest index as an optimal double-flow network model, and repeating the operation for 600 times.

Step 6, testing the double-current network:

test support set S ₃ And test query set T ₃ And inputting the test sample into an optimal double-current network model, outputting the probability that the test sample belongs to different categories, and taking the category with the highest probability as a final result of scene classification to finish the classification task.

The effect of the present invention is further explained by combining the simulation experiment as follows:

1. simulation experiment conditions are as follows:

1. operating platform configuration

The simulation platform of the experiment is a desktop computer with Intel (R) Core (TM) i7-7800X CPU and 32GB internal memory, the operating system is Ubuntu 18.04, a neural network is constructed by python 3.6 and Pythrch 1.4, and NVIDIA RTX 2080Ti and Cuda 10.0 are used for acceleration.

2. Simulation data set

The NWPU-RESISC45 dataset contains 45 scene categories, each category having 700 RGB images of 256 × 256;

the WHU-RS19 dataset contains 19 scene categories, 1005 RGB images of 600 × 600 in total;

the UC-Merced dataset contains 21 scene classes of 100 RGB images of 256 x 256 each.

3. Simulation parameter setting

The simulation experiment adopts an Adam optimizer, the initial learning rate is 0.0001, the training iteration times are 300000, the learning rate is attenuated every 100000 generations, the number k of nearest neighbors searched in the measurement module is set to be 3, the hyper-parameter M in the improved cosine loss function is set to be 0.01, and the hyper-parameter E of the key region positioning module _Tr The setting was 70%.

The small sample scene is usually expressed as a C-way K-shot problem according to the number of categories and the number of samples of the selection support set and the query set, and the most common 5-way 1-shot small sample scene with N =15 and 5-way5-shot small sample scene with N =10 are selected in the present embodiment.

2. Emulated content

Simulation 1, namely respectively adopting the method and the existing MatchingNet, DLA-MatchNet and DN4 methods to classify scenes under the 5-way 1-shot small sample scenes of large-scale remote sensing image public data sets NWPU-RESISC45, WHU-RS19 and UC Merceded to obtain classification results of all the methods; the classification accuracy of each was calculated and the results are shown in table 1.

TABLE 1 Classification precision comparison of the present invention and the existing method in a 5-way 1-shot small sample scene of three datasets

Method	NWPU-RESISC45	UC Merced	WHU-RS19
				Existing MatchingNet	54.46％±0.77％	46.16％±0.71％	60.60％±0.68％
Existing DLA	68.80％±0.70％	53.76％±0.62％	68.27％±1.83％
				Existing DN4	66.39％±0.86％	57.25％±1.01％	82.14％±0.80％
The method of the invention	73.84％±0.80％	68.12％±0.81％	87.34％±0.62％

Simulation 2, namely respectively carrying out scene classification on a large-scale remote sensing image public data set NWPU-RESISC45, WHU-RS19 and 5-way5-shot small sample scene of UC Merceded by adopting the method and the conventional MatchingNet, DLA-MatchNet and DN4 methods to obtain classification results of all the methods; the classification accuracy of each was calculated and the results are shown in table 2.

TABLE 2 Classification accuracy comparison of the present invention and the prior art method in a 5-way5-shot small sample scenario of three datasets

Method	NWPU-RESISC45	UC Merced	WHU-RS19
				Existing MatchingNet	67.87％±0.59％	66.73％±0.56％	82.99％±0.40％
Existing DLA	81.63％±0.46％	63.01％±0.51％	79.89％±0.33％
				Existing DN4	83.24％±0.87％	79.74％±0.78％	96.02％±0.33％
The method of the invention	87.86％±0.51％	88.57％±0.52％	98.25％±0.15％

From the experimental results in tables 1 and 2, it can be seen that the accuracy of the method is highest in three data sets no matter the task is a 5-way 1-shot task or a 5-way5-shot task, which indicates that the classification performance of the method is best, the loss of image discrimination information can be reduced, the influence of severe changes of complex backgrounds and object scales on scene classification can be avoided, the classification accuracy of remote sensing image scenes in small sample scenes can be effectively improved, and the method has very important practical application value in real scenes with few classification samples.

Claims

1. A remote sensing image small sample scene classification method based on a multi-scale double-flow architecture is characterized by comprising the following steps:

(3) Constructing an integral double-flow network:

3b) Selecting an existing original network as a local stream network;

(5) Will verify the support set S ₂ And verifying the query set T ₂ Inputting the network parameters into the trained double-flow network for fine adjustment, and storing the network with the highest index as an optimal double-flow network model;

2. The method of claim 1, wherein the step (2) is implemented as follows:

2a) And (3) dividing three preprocessed remote sensing image data sets respectively, namely dividing the NWPU-RESISC45 data set according to the weight ratio of 25:10:10 into a training set, a verification set and a test set, and dividing a WHU-RS19 data set into 9:5:5, dividing the UC-Merced data set into a training set, a verification set and a test set according to the ratio of 10:6:5, dividing the ratio into a training set, a verification set and a test set;

2b) Respectively randomly selecting a class from the training set of each data set, randomly sampling K images in each class, and forming a training support set S by the C multiplied by K images ₁ Simultaneously, randomly selecting N images in each category in equal amount in the remaining images of the C categories to form a training query set T ₁ ；

2c) Randomly selecting C categories from the verification set of each data set, randomly sampling K images in each category, and forming a verification support set S by the C multiplied by K images ₂ (ii) a Randomly selecting N images in the same amount in each category in the images in the C categories to form a verification query set T ₂ ；

2d) Randomly selecting C categories from the test set of each data set, randomly sampling K images in each category, and forming a test support set S by the C multiplied by K images ₃ Randomly selecting N images in each category in equal amount in the images in the C categories to form a test query set T ₃ 。

3. The method of claim 1, wherein the global flow network structure constructed in step (3 a) is as follows:

the attention depth embedding module A is formed by cascading a convolution layer, four convolution blocks, an average pooling layer and a 1 multiplied by 1 convolution layer with 128 channels; the convolution layer consists of 7 multiplied by 7 convolution filters with the channel number of 64 and 3 multiplied by 3 maximum pooling operation, each convolution block consists of 4 3 multiplied by 3 convolution filters, one jump connection operation is added after every two convolution filters, the channel number of the first convolution block is 64, the channel number of the second convolution block is 128, the channel number of the third convolution block is 256, and the channel number of the fourth convolution block is 512;

the category-dependent attention module B is used for outputting an attention feature map M,

the measurement module C is used for outputting the similarity between the given query image q and the class C

The input end of the category-related attention module B is connected with the output end of the second convolution block of the attention depth embedding module A, the output end of the category-related attention module B and the output end of the last 1 multiplied by 1 convolution layer of the attention depth embedding module A are simultaneously connected to the measurement module C, and the output of the measurement module C is the output of the global flow network.

4. The method of claim 3, wherein the category-dependent attention module B computes the output attention feature map M according to the following formula:

M(x)＝Sigmoid(∑d _i f _i )

wherein f is _i Characteristic of the ith channel representing input, d _i Representing the weight vector, calculated as follows:

and

weights for both fully connected networks for scaling down and expanding the feature map dimensions, respectively, N representing the total number of feature map pixels, f _g (x _j )＝W _g ·x _j ，f _k (x _j )＝softmax(W _k ·x _j ) An attention feature map along pixel point j is calculated,

representing a matrix multiplication.

5. The method of claim 3, wherein the metrics module C computes similarity between the output given query image q to class C

Is for each descriptor x _i First, find its k nearest neighbors in class c

Recalculate x _i And each of

The similarity between the descriptors is weighted and summed by using the attention map, and the similarity between the given query image q and the class c is obtained

denotes x _i The jth nearest neighbor in class c, cos (-), represents the cosine similarity between the two vectors.

6. The method of claim 1, wherein step (3 c) establishes a key region location module consisting of a vector construction operation and a greedy border search as follows:

3c1) The vector construction operation is along the height and width directions of the space, and the attention is sought to be aggregated into two one-dimensional structure energy vectors:

wherein the content of the first and second substances,

is the energy vector polymerized along the height direction,

is an energy vector aggregated along the width direction, M represents the obtained attention feature map, M (i, W) represents a value of the feature map M at the (i, W) position, M (H, j) represents a value of the feature map M at the (H, j) position, H represents the total height, and W represents the total width;

3c2) Greedy boundary search is to quickly and accurately locate the most important one-dimensional region in a one-dimensional energy vector to obtain the coordinate point B = [ x ] of a region boundary box _a ,x _b ,y _a ,y _b ]。

7. The method of claim 1, wherein the step (4) utilizes a training support set S ₁ And training the query set T ₁ The whole double-flow network is trained by a small sample scene training method, and the following steps are realized:

4a) Setting the maximum training iteration number to be 300000, setting the initial learning rate to be 0.0001, and attenuating the learning rate every 100000 generations;

4b) Setting a global flow network loss function as a central loss function and adding the improved cosine loss function, and setting a local flow network loss function as a negative logarithm probability loss function;

4c) Will train the support set S ₁ Images and training query set T in (1) ₁ The images in the step (2) are input into a double-flow network in batches, and the value of a loss function of the images is calculated according to the probability of image prediction categories output by a global flow network and a local flow network;

4d) Adopting Adam algorithm to carry out back propagation on the loss value, and adjusting network parameters;

4e) And (5) repeating the steps (4 c) - (4 d) until the preset maximum training iteration number is reached, and obtaining the trained double-flow network.