CN113705532B

CN113705532B - Target detection method, device and equipment based on medium-low resolution remote sensing image

Info

Publication number: CN113705532B
Application number: CN202111064462.6A
Authority: CN
Inventors: 邹焕新; 贺诗甜; 李润林; 曹旭; 李美霖; 成飞; 魏娟; 孙丽
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-09-10
Filing date: 2021-09-10
Publication date: 2023-05-23
Anticipated expiration: 2041-09-10
Also published as: CN113705532A

Abstract

The application relates to a target detection method, device and equipment based on a medium-low resolution remote sensing image. The method comprises the following steps: the deep learning framework based on the knowledge distillation concept and aiming at the middle-low resolution image is constructed, the deep learning framework comprises a teacher network and a student network, the training is carried out by adopting an original high-resolution image and the middle-low resolution image, the original high-resolution image is used as a truth value label to monitor the training of a super-resolution unit in the student network in the training process, and meanwhile, a high-resolution special representation can be provided to be used as a feature truth value to monitor the training of a student network in-life detection unit, so that the detection signal of the middle-low resolution remote sensing image in the trained target detection model is improved.

Description

Target detection method, device and equipment based on medium-low resolution remote sensing image

Technical Field

The present disclosure relates to the field of remote sensing image resolution technologies, and in particular, to a method, an apparatus, and a device for detecting a target based on a middle-low resolution remote sensing image.

Background

Mid-low resolution remote sensing image (GSD >10 m/pixel) ship target detection is a valuable and challenging task because mid-low resolution images do not provide sufficient detailed information for the detection network compared to high resolution images. In order to address this challenge, some methods use image super-resolution as an image preprocessing step to recover the missing detail information in the middle-low resolution image, thereby improving the detection performance. However, these methods only use the high-resolution image as a truth-value tag to supervise the training of the superdivision module, and are not used in the detection module, which limits the performance of the medium-low resolution image ship detection.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method, apparatus and device for detecting a target in a medium-low resolution image, which can fully utilize features of an original high resolution image.

A target detection method based on a medium-low resolution remote sensing image, the method comprising:

acquiring a training sample set, wherein the training sample set comprises a plurality of original high-resolution images and medium-low-resolution images;

the original high-resolution images and the corresponding middle-low resolution images form a training group, and the training group is input into a target detection model for iterative training, so that a trained target detection model is obtained;

the target detection model comprises a student network and a teacher network, the student network comprises a super-resolution unit and a detection unit, the teacher network comprises the detection unit with the same structure as that in the student network, and when iterative training is carried out, the original high-resolution image is used as a true image to supervise the super-resolution unit for training, and meanwhile, the high-resolution image feature supervision detection unit is provided for training;

and acquiring a middle-low resolution remote sensing image to be detected, inputting the middle-low resolution remote sensing image into a trained target detection model, and detecting the target position in the middle-low resolution remote sensing image.

In one embodiment, the middle-low resolution image for training the target detection model is obtained by performing eight-times downsampling on an original high resolution image, and middle-low resolution images generated corresponding to the original high resolution images are combined into the training set.

In one embodiment, the forming the training set from each of the original high-resolution images and the corresponding middle-low resolution images into the training set input target detection model, and performing iterative training on the training set comprises:

the detection units in the student network and the teacher network comprise backbone networks and regional suggestion frame networks which are connected in sequence;

inputting the middle-low resolution image into the student network, wherein a super resolution image corresponding to the middle-low resolution image is obtained through the super resolution unit, corresponding middle-low resolution image features are extracted from the super resolution image through the backbone network, the middle-low resolution image features are predicted through the region suggestion frame network and the detection head module, a prediction result of a target position in the middle-low resolution image is obtained, and the prediction result is used as input of the target detection model;

inputting the original high-resolution image into the teacher network, wherein corresponding high-resolution image features are extracted from the original high-resolution image through the backbone network, and then the high-resolution image features are predicted through the area suggestion frame network and the detection head module, so that a prediction result of a target position in the original high-resolution image is obtained.

In one embodiment, when the training set input target detection model is formed by combining each of the original high-resolution images and the corresponding middle-low resolution images, the method further includes: calculating a loss function when each iteration training is carried out, and adjusting parameters of a target detection model according to the loss function until the loss function converges;

the loss function includes: detecting a loss function, a super-resolution loss function and a distillation loss function;

the super-resolution loss function is obtained by calculation according to the super-resolution image and the corresponding high-resolution image;

and the distillation loss function is obtained by calculation according to the middle-low resolution image characteristics and the corresponding high resolution characteristics.

In one embodiment, before the distillation loss function is obtained by calculating the middle-low resolution image features and the corresponding high resolution features, the middle-low resolution image features and the corresponding high resolution features are reordered in the channel dimension, and then the reordered middle-low resolution image features and the corresponding high resolution features are subjected to instance regularization to reduce the domain difference between the two features.

In one embodiment, when the backbone networks in the student network and the teacher network respectively perform feature extraction on the middle-low resolution image and the original high resolution image, the four output feature layers of the feature map pyramid network of the backbone network output the middle-low resolution image features and the high resolution image features.

In one embodiment, the detection unit comprises a Faster-RCNN neural network or an SSD neural network;

the superdivision unit adopts an RDN network.

The application also provides a target detection device based on the medium-low resolution remote sensing image, which comprises:

the training sample set acquisition module is used for acquiring a training sample set, wherein the training sample set comprises a plurality of original high-resolution images and medium-low resolution images;

the target detection model training module is used for forming a training group by the original high-resolution images and the corresponding middle-low resolution images into an input target detection model for iterative training, and obtaining a trained target detection model;

the target position detection module is used for acquiring a middle-low resolution remote sensing image to be detected, inputting the middle-low resolution remote sensing image into a trained target detection model, and detecting the target position in the middle-low resolution remote sensing image.

A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:

and acquiring a middle-low resolution remote sensing image to be detected, inputting the middle-low resolution remote sensing image into a trained target detection model, and detecting the target position in the middle-low resolution remote sensing image. A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

According to the target detection method, device and equipment based on the medium-low resolution remote sensing image, a deep learning framework which is based on a knowledge distillation concept and is used for detecting targets of the medium-low resolution image is constructed, the deep learning framework comprises a teacher network and a student network, an original high-resolution image and the medium-low resolution image are adopted for training, the original high-resolution image is used as a truth value label to supervise training of a super-resolution unit in the student network in the training process, and meanwhile, a high-resolution special representation can be provided to serve as a feature truth value to supervise training of a detection unit in the student network, so that the detection signals of the medium-low resolution remote sensing image in the trained target detection model are improved.

Drawings

FIG. 1 is a flow chart of a method of detecting targets in one embodiment;

FIG. 2 is a schematic diagram of the structure of a target detection model in one embodiment;

FIG. 3 is a schematic diagram of a super-resolution network in one embodiment;

FIG. 4 is a visual comparison of image features in a teacher network and a student network before and after re-ordering and regularization in one embodiment;

FIG. 5 is a corresponding distribution of image features in a teacher network and a student network before and after ordering in one embodiment;

FIG. 6 is a graph of the detection results of different algorithms on three scenarios;

FIG. 7 is a schematic diagram of a structure of an object detection device according to an embodiment;

fig. 8 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

As shown in fig. 1, a target detection method based on a middle-low resolution remote sensing image is provided, which comprises the following steps:

step S100, a training sample set is obtained, wherein the training sample set comprises a plurality of original high-resolution images and medium-low-resolution images.

Step S110, forming a training group by each original high-resolution image and a corresponding middle-low resolution image, inputting the training group into a target detection model, and performing iterative training on the training group to obtain a trained target detection model;

the target detection model comprises a student network and a teacher network, wherein the student network comprises a super-resolution unit and a detection unit, the teacher network comprises the detection unit with the same structure as that in the student network, and when iterative training is carried out, the original high-resolution image is used as a true image to supervise the super-resolution unit for training, and meanwhile, the high-resolution image characteristic supervision detection unit is provided for training.

Step S120, a middle-low resolution remote sensing image to be detected is obtained, and is input into a trained target detection model to detect the target position in the middle-low resolution remote sensing image.

In the embodiment, a characteristic distillation frame for performing target detection on the middle-low resolution image, namely a target detection model is built, so that the original high resolution image can be simultaneously utilized by a super resolution unit and a detection unit of a student network in the target detection model, the characteristics of the original high resolution image are fully utilized, and the performance of the target detection model is improved.

In this application, the concept of knowledge distillation is utilized, while the initial knowledge distillation utilizes a larger model as the teacher's network and a smaller model as the student's network, with the same inputs to the teacher and students. Knowledge learned by the teacher's network (including response-based knowledge, feature-based knowledge, and relative relationship-based knowledge, etc.) is passed to the students through distillation losses.

In this embodiment, the original high resolution image and the middle and low resolution images may be remote sensing images targeting a ship or a vehicle, and the method will be described herein by taking the ship as an example. And wherein the medium-low resolution image includes a medium resolution image and a low resolution image.

In this embodiment, the middle-low resolution image for training the target detection model is obtained by performing eight-fold downsampling on the original high resolution image, and the middle-low resolution images generated corresponding to the original high resolution images form a training set. That is to say, the training set includes an original high-resolution image and a middle-low resolution image obtained by eight times downsampling the original high-resolution image.

As shown in fig. 2, the object detection model includes a student network and a teacher network, wherein the student network is composed of a super-resolution unit and a detection unit, and the teacher network is composed via a detection unit in accordance with the structure of the student network. And the detection unit also comprises a backbone network and a regional suggestion frame network which are connected in sequence.

The step of forming each original high-resolution image and the corresponding middle-low resolution image into a training group and inputting the training group into a target detection model to carry out iterative training comprises the following steps:

and inputting the middle-low resolution image into a student network, wherein a super resolution image corresponding to the middle-low resolution image is obtained through a super resolution unit, the corresponding middle-low resolution image features are extracted from the super resolution image through a backbone network, the middle-low resolution image features are predicted through a region suggestion frame network and a detection head module, a prediction result of a target position in the middle-low resolution image is obtained, and the prediction result is used as input of a target detection model.

Specifically, in the student network, the medium-low resolution image I is input into the super-resolution unit _LR ∈R ^H×W×3 Outputting the super-divided image I _SR ∈R ^αH×αW×3 Where H, W represents the length and width of the input image and α represents the superdivision multiple. The super-resolution unit is constructed as shown in FIG. 3, and the super-resolution module first extracts residual features F using 8 dense residual blocks (RDBs) _res And combine it with the original feature F ₀ And performing addition fusion. The fused features are passed through a sub-pixel layer to generate residual image prediction I _res Adding the residual prediction and the image sampled from the input image to obtain a final super-resolution image I _SR . In the super-resolution unit, the high-resolution image is used as a truth-value tag to penalize erroneous super-resolution results, and the super-resolution module is supervised for training, which will be further described later.

The image generated by the super-resolution unit is sent to a detection unit for target detection. Since the object detection model in this application can be applied to different detectors (including the Faster-RCNN neural network, or SSD neural network), faster-RCNN is chosen as an example for description herein. In the detection unit, the super-resolution image is first sent to a backbone network (e.g., resNet101, resNet50, HRNet, resNeXt 101) for feature extraction, and the extracted features are sent to a regional suggestion frame network (RPN) for extracting high quality suggestion frames. The proposed box is fed into the later layers along with the features extracted by the backbone network for final detection box prediction, i.e. for predicting the target position in the medium-low resolution image.

Inputting the original high-resolution image into a teacher network, extracting corresponding high-resolution image features from the original high-resolution image through a backbone network, and predicting the high-resolution image features through a region suggestion frame network and a detection head module to obtain a prediction result of a target position in the original high-resolution image.

Specifically, in the target detection model, the teacher network includes only detection units of the student network, unlike the student network, the teacher network is input as an original high-resolution image. The high-resolution feature representation generated by the teacher's network provides informative knowledge and is passed to the student's network through a feature distillation.

In this embodiment, the detection head modules in the teacher network and the student network are detection head modules in the original fast-RCNN, where the features first go through an RPN to extract a high quality suggestion frame, and the suggestion frame and the features extracted by the backbone network are sent together to a later layer for final detection frame prediction.

In this embodiment, knowledge is transferred from the teacher network to the student network by means of feature distillation, so that the student network can generate features with more abundant information. The direct idea of feature distillation is to minimize the distance between the generated features in the teacher network and the student network by using a loss of L1 (i.e. the calculation of summing after sequentially calculating the difference and absolute value between the corresponding pixels of the two feature images) or L2 (i.e. the calculation of summing after sequentially calculating the difference and square between the corresponding pixels of the two feature images). The generated features in the teacher network and the student network are respectively high-resolution image features and medium-low-resolution image features extracted by the backbone network.

As shown in fig. 4 (a) and 5 (a), (where fig. 4 (a) is the original characteristics of the teacher network and the student network corresponding to different channels, fig. 4 (b) is the ordered characteristics of the teacher network and the student network corresponding to different channels, fig. 4 (c) is the regularized characteristics of the example teacher network and the student network corresponding to different channels, and fig. 5 (a) is the channel intensity of the original characteristics, and fig. 5 (b) is the channel intensity of the ordered characteristics), the image characteristics generated by the teacher network and the student network are distributed in the same channel with significant differences, and the distribution in the channel dimensions is also significantly different. Therefore, directly constraining the loss of image features using pixel level would make the student network more focused on the differences in channel distribution, neglecting to learn true knowledge (discriminative spatial distribution), and thus affecting the detection performance improvement.

In view of the above, in this embodiment, a channel sorting method is provided to solve the channel difference problem, that is, before the distillation loss function is obtained by calculating according to the middle-low resolution image features and the corresponding high resolution features, the middle-low resolution image features and the corresponding high resolution features are reordered in the channel dimension, and then the re-ordered middle-low resolution image features and the corresponding high resolution features are subjected to instance regularization to reduce the domain difference between the two features.

Specifically, first, the global average pooling of space is used as a feature for each channel response, and fig. 4 (a) shows the distribution of the teacher and student channels accordingly. The features are ordered in the channel dimension according to the channel response values, and the ordered channel distribution is shown in fig. 4 (b). As can be seen from fig. 4 (a) and 4 (b), the characteristics of the ordered teacher and student networks have a more similar channel distribution. And then, domain differences among characteristics generated by students of teachers are further reduced through an example regularization, so that the detection performance of the target detection model is further improved.

In this embodiment, when the training set input target detection model is formed by combining each original high-resolution image and the corresponding middle-low resolution image, the iterative training is further performed by: and calculating a loss function when each iteration training is performed, and adjusting parameters of the target detection model according to the loss function until the loss function converges.

Specifically, the overall loss function of the object detection model may be defined as:

L _overall ＝L _Det +λ ₁ L _SR +λ ₂ L _Distill (1)

in formula (1), L _Det To detect the loss function, L _SR As super-resolution loss function, L _Distill Is a distillation loss function. Wherein loss L is detected _Det The super-resolution loss L is the loss of the original Faster-RCNN _SR Is superdivision ofThe L1 distance between the distinguished image and the high-resolution truth image, i.e. the original high-resolution image, is:

L _SR ＝||I _SR -I _HR || ₁ (2)

and loss of distillation L _Distill I.e. the characteristic distillation losses set forth above:

in the formula (3), medium-low resolution image features and high resolution image features output by four output feature layers of a feature map pyramid network (Feature Pyramid Networks, FPN) of a backbone network are selected for feature distillation,

and->

Representing the characteristics of the teacher and student ith FPN output layer, H represents the channel ordering and regularization operations set forth above.

In the last step S120, the obtained middle-low resolution remote sensing image to be detected is obtained through actual measurement, and is input into the student network in the trained target detection model, and the target position in the middle-low resolution remote sensing image is detected. The input of the target detection model is an image with a detection frame, and the detection frame marks the position of the target.

Next, the effectiveness of the target detection model in this embodiment is verified by performing an ablation experiment, and finally the target detection model in this method is compared with some mainstream algorithms on the public data set.

Data set and implementation details

1 data set: experiments were performed on three data sets: HRSC2016, DOTA, and NWPUVHR-10.

HRSC2016 is a high resolution remote sensing scene ship target detection dataset comprising 617 training images and 438 Zhang Yanzheng images, with a picture size of 300 x 300 to 1500 x 900. All pictures in the experiment were sampled to a size of 800 x 512 as true high resolution images.

DOTA is a large-scale remote sensing target ship detection dataset, and comprises 15 types of targets including ships. In the experiment, the pictures were cut into 512 x 512 sized slices, and the slice containing the ship target was selected as the experimental dataset. The final dataset included 4164 training images and 1411 test images.

NWPUVHR-10 is a target detection dataset containing 10 categories. Likewise, the image was cropped to 512 x 512, and then a slice containing the ship target was selected as experimental data. Experimental data had 249 training pictures and 52 validation pictures. In the experiment, the above-described picture (slice) was used as a high-resolution true value image, which was downsampled 8 times as an input medium-low resolution image. The expansion modes of the data during training comprise: random inversion, rotation, color conversion, brightness conversion, and contrast conversion.

2 training procedure

Firstly, pretraining a teacher network and a student network, and performing characteristic distillation on the whole frame after pretraining. In the distillation process, teacher network parameters are frozen, and only student networks update parameters. The training stage carries out 24 rounds of training in total, and the initial learning rate is set as follows: the learning rate per training 10 rounds decays by 0.005 times the current 0.1 times.

3 authentication policy

The experiment was evaluated using mAP50 and mAP75 as accuracy indicators.

(II) ablation experiments

Experiments ablation experiments were performed on HRSC2016 dataset to verify the validity and versatility of the target detection model of the present application.

Table 1: detection Performance of different Experimental variants in the framework proposed by the present invention

1 action of super resolution unit on ship detection

In order to verify the gain of the super resolution of the image to ship detection, removing the super resolution unit in the student network, performing bicubic interpolation up-sampling on the input middle-low resolution image, and sending the interpolated image to the detection unit. Then, a super-resolution unit is selected for up-sampling, and the super-resolved image is sent to a detection unit. As shown in table 1, the detection performance of directly inputting the middle-low resolution image for detection is poor, the detection performance is obviously improved after the last three times of interpolation is introduced, and the detection performance is further improved after the super-resolution image is further introduced, and the effectiveness of the super-resolution module for improving the detection performance is verified by the group of experiments.

2 characteristic distillation

To verify the effectiveness of the feature distillation, the performance of the student network before and after the feature distillation was compared. As shown in table 1, after using feature distillation, the student network was optimized according to the high resolution features provided by the teacher network, and a performance improvement of 0.03-0.08 could be further achieved. Under the additional supervision of a teacher network, the student network can extract the characteristic rich in information from the super-resolution image, so that the ship detection performance is improved.

3-channel ordering and regularization

Table 2: influence of ordering and regularization operations on algorithm accuracy

To verify the effect of channel ordering and instance regularization, we compared the detection accuracy using L2 loss directly with the L2 loss after ordering and normalization operations. As can be seen from table 2, the detection accuracy is low by directly using L2 loss. Channel ordering and instance regularization can reduce differences of features extracted by a teacher network and a student network, so that the student network focuses on spatial differences of features and knowledge learning is promoted.

4 selection of distillation positions

Table 3: detection precision of characteristic distillation by selecting different FPN output layers

The FPN has four output feature layers in total, the four features having different dimensions. In the experiment we compared the effect of distillation using different feature layers on the detection performance. It can be seen from table 3 that the choice of different feature layers does not have a great influence on the detection performance, indicating that the framework proposed by the present invention is insensitive to the choice of distillation location. Based on the experimental results of table 3, four output feature layers of FPN were selected for feature distillation in this application.

Weights of 5 loss functions

Table 4: influence of the selection of the super-parameters on the detection accuracy

The loss function of the proposed framework of the invention comprises two super-parameters lambda ₁ And lambda (lambda) ₂ The weights of the superdivision loss and the distillation loss in the total loss are controlled separately. In Table 4 we verify the selection range of the hyper-parameters, from which it can be seen that when λ ₁ ＝1，λ ₂ At=0.001, the method of the present application achieves the best performance.

(III) comparative experiments

In a comparative experiment, the framework proposed in the present application was applied to the Faster-RCNN and compared with mainstream target detection algorithms such as HTC, detectRS, repponts and GFL, and the experimental results and analyses were as follows.

1 quantitative analysis

Table 5: detection performance of different algorithms on three data sets, HRSC2016, DOTA, and NWPUVHR-10. Wherein, the fast-RCNN is selected as the detection module of the framework provided by the invention, and the reasoning time is the average reasoning time of a single image when the input image on the HRSC2016 data set is 100 x 64 size

The detection precision of the algorithm and the comparison algorithm in the application is shown in table 5, and it can be seen from the table that the method in the application obtains great performance improvement compared with the original fast-RCNN algorithm, and is superior to the comparison algorithm in all three data sets. The performance of the detection algorithm is similar to that of the NWPUVHR-10 data set, but the parameter quantity of the detection algorithm is far higher than that of the method. Meanwhile, even though both HTC, detectoRS and the method in this application can be regarded as improved versions of the fast-RCNN, HTC and detectrs cannot achieve better performance due to lack of detail information because the input image is a bicubic interpolated image. In contrast, the method introduces the super-resolution module and the characteristic distillation, so that the missing detail information in the middle-low resolution image can be recovered well, and the improvement of the ship target detection performance is promoted.

2 qualitative analysis

As shown in fig. 6, the detection results of different algorithms on three scenes, where "HR" represents the original high resolution image and "GT" represents the true value labeling of the target. The detection results of different algorithms are marked by using boxes in the figure, and the selected area is enlarged, and the enlarged image is displayed so as to be convenient for observation and analysis.

Fig. 6 shows the detection results of different algorithms on three images of a scene. The method can be used for predicting the target frame more accurately and has fewer false alarms and missed detection conditions. For example, where two vessels in scenario A are closely and closely docked, GFL, reppoints and HTC cannot accurately distinguish between the two vessels, faster-RCNN and DetectorRS may generate false alarms. In contrast, the method of the invention accurately detects two ships and does not generate false alarm. Scene B and scene C can prove the superior performance of the method in detecting middle-low resolution ships.

In the target detection method based on the medium-low resolution remote sensing image, the characteristic distillation framework, namely the target detection model, is provided for the target detection task of the medium-low resolution image, wherein the high resolution image is used for training the true image supervision super-resolution unit, and meanwhile, the high resolution characteristic supervision detection unit is trained, so that the network performance is further improved. And a series of ablation experiments are performed on the target detection model provided by the application to verify the effectiveness and universality of the target detection model, so that the target detection model can be used for different target detection algorithms and consistent performance improvement is achieved. Finally, the target detection model is applied to the fast-RCNN and compared with the current mainstream target detection algorithm, and comparison results on three public data aggregation show that the target detection model provided in the application can further improve the performance of the target detection algorithm based on super resolution, and can obtain better performance compared with the mainstream algorithm.

It should be understood that, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of other steps or sub-steps of other steps.

In one embodiment, as shown in fig. 7, there is provided an object detection device based on a low-and-medium resolution remote sensing image, including: a training sample set acquisition module 200, a target detection model training module 210, and a target location detection module 220, wherein:

a training sample set obtaining module 200, configured to obtain a training sample set, where the training sample set includes a plurality of original high-resolution images and medium-low resolution images;

the target detection model training module 210 is configured to form a training set from each of the original high-resolution images and the corresponding middle-low resolution images, input the training set into a target detection model, and perform iterative training on the training set to obtain a trained target detection model;

the target position detection module 220 is configured to obtain a middle-low resolution remote sensing image to be detected, input the middle-low resolution remote sensing image to a trained target detection model, and detect a target position in the middle-low resolution remote sensing image.

For specific limitation of the target detection device based on the middle-low resolution remote sensing image, reference may be made to the limitation of the target detection method based on the middle-low resolution remote sensing image hereinabove, and the description thereof will not be repeated here. The modules in the target detection device based on the medium-low resolution remote sensing image can be all or partially realized by software, hardware and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 8. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to realize a target detection method based on the medium-low resolution remote sensing image. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in fig. 8 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. The target detection method based on the medium-low resolution remote sensing image is characterized by comprising the following steps of:

inputting each original high-resolution image and a corresponding middle-low resolution image into a training group to perform iterative training on the training group to obtain a trained target detection model, wherein the target detection model comprises a student network and a teacher network, the student network comprises a super-resolution unit and a detection unit, the teacher network comprises the detection unit with the same structure as the student network, and when performing iterative training, the original high-resolution image is used as a truth image to monitor the super-resolution unit to perform training, and meanwhile, a high-resolution image feature monitoring detection unit is also provided to perform training, wherein the iterative training on the target detection model comprises the following steps: the method comprises the steps that detection units in a student network and a teacher network respectively comprise a backbone network and a region suggestion frame network which are sequentially connected, the medium-low resolution image is input into the student network, a super resolution image corresponding to the medium-low resolution image is obtained through the super resolution unit, corresponding medium-low resolution image features are extracted from the super resolution image through the backbone network, the medium-low resolution image features are predicted through the region suggestion frame network and a detection head module, a prediction result of a target position in the medium-low resolution image is obtained, the prediction result is used as input of a target detection model, the original high resolution image is input into the teacher network, corresponding high resolution image features are extracted from the original high resolution image through the backbone network, and the prediction result of the target position in the original high resolution image is obtained through the region suggestion frame network and the detection head module;

the method further comprises the following steps when the target detection model is subjected to iterative training: calculating a loss function when each iteration training is carried out, adjusting parameters of a target detection model according to the loss function until the loss function converges, wherein the loss function comprises a distillation loss function obtained by calculating according to the middle-low resolution image features and the corresponding high resolution features, and reordering the middle-low resolution image features and the corresponding high resolution features in a channel dimension before calculating the distillation loss function, and carrying out instance regularization on the reordered middle-low resolution image features and the corresponding high resolution features to reduce domain differences between the two features;

2. The object detection method according to claim 1, wherein the middle-low resolution image for training the object detection model is obtained by eight-fold downsampling of original high resolution images, and middle-low resolution images generated corresponding to the original high resolution images are combined into the training set.

3. The method for detecting a target according to claim 2, wherein,

the loss function further includes: detecting a loss function and a super-resolution loss function;

4. The object detection method according to claim 3, wherein when the backbone network in the student network and the teacher network performs feature extraction on the middle-low resolution image and the original high resolution image, the middle-low resolution image features and the high resolution image features are output by four output feature layers of a feature map pyramid network of the backbone network.

5. The method according to any one of claims 1 to 4, wherein the detection unit is a sensor including a fast-RCNN neural network or an SSD neural network;

the super resolution unit adopts an RDN network.

6. An object detection device based on a medium-low resolution remote sensing image, which is characterized by comprising:

the target detection model training module is used for inputting each original high-resolution image and a corresponding middle-low resolution image into a training group to perform iterative training on the target detection model and obtaining a trained target detection model, wherein the target detection model comprises a student network and a teacher network, the student network comprises a super-resolution unit and a detection unit, the teacher network comprises the detection unit with the same structure as the student network, and when performing iterative training, the original high-resolution image serves as a truth image to monitor the super-resolution unit to perform training, and meanwhile, the high-resolution image feature monitoring detection unit is also provided to perform training, and the iterative training on the target detection model comprises the following steps: the method comprises the steps that detection units in a student network and a teacher network respectively comprise a backbone network and a region suggestion frame network which are sequentially connected, the medium-low resolution image is input into the student network, a super resolution image corresponding to the medium-low resolution image is obtained through the super resolution unit, corresponding medium-low resolution image features are extracted from the super resolution image through the backbone network, the medium-low resolution image features are predicted through the region suggestion frame network and a detection head module, a prediction result of a target position in the medium-low resolution image is obtained, the prediction result is used as input of a target detection model, the original high resolution image is input into the teacher network, corresponding high resolution image features are extracted from the original high resolution image through the backbone network, and the prediction result of the target position in the original high resolution image is obtained through the region suggestion frame network and the detection head module;

7. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of claim 5 when executing the computer program.