CN115546660B

CN115546660B - Target detection method, device and equipment based on video satellite data

Info

Publication number: CN115546660B
Application number: CN202211486764.7A
Authority: CN
Inventors: 赵宏杰; 陆川; 谭真
Original assignee: Chengdu Guoxing Aerospace Technology Co ltd
Current assignee: Chengdu Guoxing Aerospace Technology Co ltd
Priority date: 2022-11-25
Filing date: 2022-11-25
Publication date: 2023-04-07
Anticipated expiration: 2042-11-25
Also published as: CN115546660A

Abstract

The application discloses a target detection method, a device and equipment based on video satellite data, according to the characteristics of large size and large information amount of video satellite data, consistency judgment is carried out on a key frame and a frame to be detected through a content judgment network, targets are consistent and directly searched, if the targets are inconsistent, target identification is carried out through a detection network, and the target detection speed is improved well by utilizing a frame skipping detection strategy; meanwhile, a target detection model is obtained by improving the conventional SSD detection network, so that the method is well suitable for the multi-scale problem of multiple targets and realizes the detection of the targets on different scales; the super-parameter feature convolution layer in the target detection model can enrich the bottom layer features to the high layer so as to facilitate target identification; the deconvolution layer in the target detection model can introduce additional context information of the previous scale when target detection is executed, and the accuracy of object detection is improved.

Description

Target detection method, device and equipment based on video satellite data

Technical Field

The present application relates to the field of image processing, and in particular, to a method, an apparatus, and a device for detecting a target based on video satellite data.

Background

With the continuous development of the spatial technology in China, great progress is made in the technical field of video satellites, ground gaze data acquired by satellites can provide a lot of basic applications in the fields of industry, agriculture, traffic and the like, but satellite video data have the problems of low resolution, image jitter, serious image peripheral distortion, small targets and the like, the problems can generate great influence in the rapid detection of the targets in a large scene, and the overall detection and identification speed is low.

An SSD (Single Shot multi box Detector, SSD) is a target detection algorithm proposed by Wei Liu et al on the ECCV 2016, and is a regression-based deep convolutional neural network, the SSD can directly perform target detection and identification on the whole image, thereby greatly reducing the complexity of the network and the consumption of calculation time, and the detection process is simple, and the position and the category of the target can be simultaneously predicted for the input image, and compared with the algorithms such as R-CNN, the SSD has an obvious speed advantage, and is one of the main detection frames at present. But existing SSDs have less accuracy in detecting small targets in video satellite data.

Disclosure of Invention

The application mainly aims to provide a target detection method, a device and equipment based on video satellite data, and aims to solve the technical problem that the target detection accuracy of the existing SSD algorithm on satellite video images is low.

In order to achieve the above object, the present application provides a target detection method based on video satellite data, including:

acquiring video satellite data to be detected; the video satellite data comprises a plurality of frames to be detected;

on the basis of a preset content discrimination network, carrying out consistency detection on the frame to be detected and a preset key frame; the key frame comprises a marked target to be detected;

if the content of the frame to be detected is inconsistent with the content of the key frame, inputting the frame to be detected into a pre-trained target detection model for target identification to obtain a target detection result of the frame to be detected; the target detection model is obtained based on training of an improved SSD detection network, and the improved SSD detection network comprises a hyper-parameter feature convolution layer and a deconvolution layer;

and if the content of the frame to be detected is consistent with that of the key frame, searching a local target by using the key frame to obtain a target detection result of the frame to be detected.

Optionally, before the step of performing consistency detection on the frame to be detected and a preset key frame based on a preset content judgment network, the method further includes:

acquiring a satellite video image with a label;

constructing the content discrimination network by using the satellite video image with the label; the content discrimination network comprises a parameter-shared convolutional network and a full connection layer.

Optionally, after the step of inputting the frame to be detected into a pre-trained target detection model for target recognition to obtain a target detection result of the frame to be detected if the content of the frame to be detected is inconsistent with the content of the key frame, the method further includes:

and taking the frame to be detected as a new key frame.

Optionally, the target detection model includes a trained hyper-parametric feature convolutional layer, a deconvolution layer, and a convolutional layer; the step of inputting the frame to be detected into a pre-trained target detection model for target recognition to obtain a target detection result of the frame to be detected comprises the following steps:

inputting the frame to be detected;

performing convolution and deconvolution operations on the frame to be detected through the trained hyper-parameter feature convolution layer, the convolution layer and the deconvolution layer to obtain the features of the frame to be detected;

and obtaining a target detection result of the frame to be detected according to the characteristics.

Optionally, before the step of inputting the frame to be detected into a pre-trained target detection model for target recognition to obtain a target detection result of the frame to be detected, the method further includes:

acquiring the standard SSD detection network;

scaling the convolution layers Conv1, conv2 and Conv3 of the SSD detection network, and unifying the characteristics of the convolution layers Conv1, conv2 and Conv3 into a hyper-parameter characteristic convolution layer Conv 4;

performing feature pooling addition on the convolutional layers Conv5 and Conv6 to obtain convolutional layers Conv7 and Conv8, respectively;

pooling the convolutional layer Conv8 to obtain a convolutional layer Conv9;

performing deconvolution operation on the convolutional layer Conv9 and the convolutional layer Conv8 to obtain a deconvolution layer Deconv8 and a deconvolution layer Deconv9 respectively;

the improved SSD detection network is obtained.

Optionally, after the step of obtaining the improved SSD detection network, the method further includes:

acquiring video satellite data with a label;

carrying out data processing on the video satellite data with the tag to obtain a data set;

and training the improved SSD detection network by using the data set to obtain the target detection model.

Optionally, the step of training the improved SSD detection network by using the data set to obtain the target detection model includes:

optimizing the improved SSD detection network by a loss function as follows:

wherein x is a target discrimination variable, c is a confidence, l is a prediction box, g is a true box, N is a default box number,

in order to be a function of the confidence loss,

in order to locate the function of the loss,

as a function of location loss

The weight of (a) is determined,

for the quality control loss functions of the deconvolution layer Deconv8 and the deconvolution layer Deconv9,

is composed of

The weight of (c).

Optionally, the expression of the quality control loss function is:

wherein M is ₁ To represent

Number of layer feature channels, M ₂ Represent

The number of layer feature channels is such that,

as a result of normalization of the convolutional layer Conv7,

as a result of normalization of the convolutional layer Conv8,

as a result of the normalization of the deconvolution layer Deconv9,

is the normalized result of the deconvolution layer Deconv 8.

In addition, to achieve the above object, the present application further provides an object detection apparatus based on video satellite data, comprising:

the video satellite data acquisition module to be detected is used for acquiring video satellite data to be detected; the video satellite data comprises a plurality of frames to be detected;

the consistency detection module is used for judging a network based on preset content and carrying out consistency detection on the frame to be detected and a preset key frame; the key frame comprises a marked target to be detected;

the target detection model module is used for inputting the frame to be detected into a pre-trained target detection model for target identification to obtain a target detection result of the frame to be detected if the content of the frame to be detected is inconsistent with that of the key frame; the target detection model is obtained based on training of an improved SSD detection network, and the improved SSD detection network comprises a hyper-parameter feature convolution layer and a deconvolution layer;

and the local target searching module is used for searching a local target by using the key frame if the content of the frame to be detected is consistent with that of the key frame, so as to obtain a target detection result of the frame to be detected.

In addition, to achieve the above object, the present application further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor executes the computer program, so as to implement the above method.

The beneficial effect that this application can realize. The embodiment of the application provides a target detection method, a target detection device and target detection equipment based on video satellite data, which are used for detecting the video satellite data to be detected by acquiring the video satellite data to be detected; the video satellite data comprises a plurality of frames to be detected; judging a network based on preset content, and carrying out consistency detection on the frame to be detected and a preset key frame; the key frame comprises a marked target to be detected; if the content of the frame to be detected is inconsistent with the content of the key frame, inputting the frame to be detected into a pre-trained target detection model for target identification to obtain a target detection result of the frame to be detected; the target detection model is obtained based on training of an improved SSD detection network, and the improved SSD detection network comprises a hyper-parameter feature convolution layer and a deconvolution layer; and if the content of the frame to be detected is consistent with that of the key frame, searching a local target by using the key frame to obtain a target detection result of the frame to be detected. According to the characteristics of large data volume, large data size and large information amount of a video satellite, firstly, consistency judgment is carried out on a key frame and a frame to be detected through a content judgment network, a target is consistent and directly searched, and if the target is inconsistent, target identification is carried out through a detection network, and the target detection speed is improved well by utilizing a frame skipping detection strategy; meanwhile, during detection, a target detection model is obtained by utilizing the existing SSD detection network for improvement and training, so that the problem of multiple targets in multiple scales is well adapted, and the targets are detected on different scales; the improved hyper-parameter characteristic convolution layer in the SSD detection network structure can enrich bottom layer characteristics to high layers so as to facilitate target identification; the deconvolution layer in the target detection model can introduce additional context information of the previous scale when target detection is executed, so that the accuracy of object detection is improved, namely the small target is more concerned with detection by the hyper-parameter characteristic convolution layer and the deconvolution layer, and the accuracy of overall target detection is improved.

Drawings

FIG. 1 is a schematic diagram of a computer device in a hardware operating environment according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a target detection method based on video satellite data according to an embodiment of the present application;

fig. 3 is a schematic functional module diagram of an object detection apparatus based on video satellite data according to an embodiment of the present application;

fig. 4 is a schematic diagram of a training sample key frame of a content discrimination network based on video satellite data according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a frame a to be tested of a training sample of a content discrimination network based on video satellite data according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a frame b to be tested of a training sample of a content discrimination network based on video satellite data according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a frame c to be tested of a training sample of a content discrimination network based on video satellite data according to an embodiment of the present application;

fig. 8 is a schematic diagram of a content discrimination network based on video satellite data according to an embodiment of the present disclosure;

fig. 9 is a schematic diagram of an improved SSD detection network based on video satellite data according to an embodiment of the present application;

fig. 10 is a schematic view illustrating a construction process of a deconvolution module of an improved SSD detection network based on video satellite data according to an embodiment of the present application;

fig. 11 is a schematic diagram illustrating a detection effect of a target detection model based on video satellite data according to an embodiment of the present application;

FIG. 12 is a schematic diagram illustrating an accuracy of aircraft detection based on a target detection model of video satellite data according to an embodiment of the present application;

FIG. 13 is a graph illustrating aircraft detection recall rates for a video satellite data based object detection model according to an embodiment of the present application;

fig. 14 is a schematic diagram illustrating a detection effect of an SSD detection network based on video satellite data according to an embodiment of the present application;

fig. 15 is a schematic diagram of an aircraft detection accuracy of an SSD detection network based on video satellite data according to an embodiment of the present application;

fig. 16 is a schematic diagram of airplane detection recall rate of SSD detection network based on video satellite data according to an embodiment of the present application.

The implementation, functional features and advantages of the object of the present application will be further explained with reference to the embodiments, and with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The main solution of the embodiment of the application is as follows: the method, the device and the equipment for detecting the target based on the video satellite data are characterized in that the video satellite data to be detected are obtained; the video satellite data comprises a plurality of frames to be detected; judging a network based on preset content, and carrying out consistency detection on the frame to be detected and a preset key frame; the key frame comprises a marked target to be detected; if the content of the frame to be detected is inconsistent with the content of the key frame, inputting the frame to be detected into a pre-trained target detection model for target identification to obtain a target detection result of the frame to be detected; the target detection model is obtained based on training of an improved SSD detection network, and the improved SSD detection network comprises a hyper-parameter feature convolution layer and a deconvolution layer; and if the content of the frame to be detected is consistent with that of the key frame, searching a local target by using the key frame to obtain a target detection result of the frame to be detected.

In the prior art, with the continuous development of the spatial technology in China, great progress is made in the technical field of video satellites, the ground gaze data acquired by satellites can provide a lot of basic applications in the fields of industry, agriculture, traffic and the like, but the satellite video data has the problems of low resolution, image jitter, serious distortion around images, small targets and the like, the problems can generate great influence in the rapid detection of the targets in a large scene, and the overall detection and identification speed is low.

An SSD (Single Shot multi box Detector, SSD) is a target detection algorithm proposed by Wei Liu et al on the ECCV 2016, and is a regression-based deep convolutional neural network, the SSD can directly perform target detection and identification on the whole image, thereby greatly reducing the complexity of the network and the consumption of calculation time, and the detection process is simple, and the position and the category of the target can be simultaneously predicted for the input image, and compared with the algorithms such as R-CNN, the SSD has an obvious speed advantage, and is one of the main detection frames at present. However, the existing SSD has low accuracy in detecting small targets in video satellite data, and at the same time, the video satellite data has a large size, a large amount of information, and a relatively slow detection speed.

According to the characteristics of large data volume, large data size and large information amount of a video satellite, firstly, a content discrimination network is used for carrying out consistency judgment on a key frame and a frame to be detected, the target is consistent and directly searched, if the target is inconsistent, the target is identified through a detection network, and the frame skipping detection strategy is used for better improving the target detection speed; meanwhile, during detection, a target detection model is obtained by utilizing the existing SSD detection network for improvement and training, so that the problem of multiple targets in multiple scales is well adapted, and the targets are detected on different scales; the improved hyper-parameter characteristic convolution layer in the SSD detection network structure can enrich bottom layer characteristics to high layers so as to facilitate target identification; the deconvolution layer in the target detection model can introduce extra context information of the previous scale when target detection is executed, so that the accuracy of object detection is improved, namely the small target is more concerned by the convolution layer with hyper-parameter characteristics and the deconvolution layer, and the accuracy of overall target detection is improved; furthermore, the reconstruction error of the deconvolution is used as a part of the loss function, so that the quality of the deconvolution can be better monitored while the target is detected, and the target detection precision is ensured.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a computer device in a hardware operating environment according to an embodiment of the present application.

As shown in fig. 1, the computer apparatus may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The Memory 1005 may be a Random Access Memory (RAM) Memory, or may be a Non-Volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001 described previously.

Those skilled in the art will appreciate that the configuration shown in FIG. 1 is not intended to be limiting of computer devices and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a storage medium, may include therein an operating system, a data storage module, a network communication module, a user interface module, and an electronic program.

In the computer device shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 of the computer device of the present invention may be provided in a computer device, and the computer device calls the video satellite data-based object detection means stored in the memory 1005 through the processor 1001 and performs the video satellite data-based object detection method provided in the embodiment of the present invention.

Referring to fig. 2, based on the hardware device of the foregoing embodiment, an embodiment of the present application provides a target detection method based on video satellite data, including:

s10: acquiring video satellite data to be detected; the video satellite data comprises a plurality of frames to be detected;

in the specific implementation process, the video satellite is a novel earth observation satellite, and compared with the traditional earth observation satellite, the video satellite has the greatest characteristic that the video satellite can perform staring observation on a certain area, obtains more dynamic information than the traditional satellite in a video recording mode, is particularly suitable for observing dynamic targets and analyzing the transient characteristics of the dynamic targets, and commonly used video satellites comprise a series of satellites in Jilin, and the like. When the target detection is performed, detection needs to be performed for each frame of data. However, in the image data of the video satellite, the ground targets are usually relatively small, and the accuracy of target identification and detection is usually low.

S20: judging a network based on preset content, and carrying out consistency detection on the frame to be detected and a preset key frame; the key frame comprises a marked target to be detected;

in the specific implementation process, the key frame refers to a video frame image containing a target, and the initial key frame can be manually specified or randomly specified. The key frame marks the target to be detected, and if each target in the current frame to be detected is overlapped with each corresponding target in the key frame in position, the target contents of the two frames of data are judged to be consistent, otherwise, the two frames of data are inconsistent. Fig. 4-7 are schematic diagrams of training sample structures of a content-aware network, and as shown in the key frame image content of fig. 4, the embodiment has 4 targets to be measured, which are respectively represented by 4 color blocks, fig. 5-7 respectively represent different image contents of frames to be measured, a solid color block represents a position of the target to be measured in a current frame to be measured, and different dashed boxes represent corresponding positions of the target to be measured in fig. 4. The positions of each target in the frame a to be detected and the key frame are overlapped, and the consistency detection result is consistent; the frame b to be detected and the key frame, and the frame c to be detected and the 4 targets in the key frame are not overlapped in position, and the consistency detection result is inconsistent.

As an optional implementation manner, before the step of performing consistency detection on the frame to be detected and the preset key frame, the method for determining a consistency between the frame to be detected and the preset key frame further includes: acquiring a satellite video image with a label; constructing the content discrimination network by using the satellite video image with the label; the content discrimination network comprises a parameter sharing convolution network and a full connection layer.

In the implementation process, before consistency detection is performed, a plurality of labeled satellite video images are acquired, and a content discrimination network shown in fig. 8 is constructed by using the labeled satellite video images. The convolution network in the content discrimination network is a shared parameter, two frames of images are input into the network to obtain the characteristics output by the full-connection layer with the same dimensionality, and similarity measurement between the characteristics is learned according to the label information to perform classification discrimination.

S30: if the content of the frame to be detected is inconsistent with the content of the key frame, inputting the frame to be detected into a pre-trained target detection model for target identification to obtain a target detection result of the frame to be detected; the target detection model is obtained based on training of an improved SSD detection network, and the improved SSD detection network comprises a hyper-parameter feature convolution layer and a deconvolution layer;

in the specific implementation process, the targets refer to airplanes, automobiles and the like on the ground, the data obtained through the video satellite are large in size and large in information amount, the targets of the ground objects such as the airplanes and the automobiles are small compared with the data of the video satellite, and the accuracy of small target detection and identification is improved by improving the hyper-parameter feature convolution layer and the deconvolution layer in the SSD detection network. When the content of the frame to be detected is inconsistent with the content of the key frame, it is indicated that the feature of the target to be detected has a drastic change in the frame to be detected because a new target appears in the frame to be detected currently, for example, a new airplane flies into the field of view, or because the number of targets in the frame to be detected currently is reduced, for example, an airplane flies out of the field of view, or because the shape of the target has a relatively large change, for example, the flight angle of the target airplane deflects three-dimensionally, and the flight attitude has a change. And after the inconsistency is judged, inputting the frame to be detected into the pre-trained target detection model for target identification.

As an optional implementation manner, before the step of inputting the frame to be detected into a pre-trained target detection model for target recognition to obtain a target detection result of the frame to be detected, the method further includes: acquiring the standard SSD detection network; scaling the convolution layers Conv1, conv2 and Conv3 of the SSD detection network, and unifying the characteristics of the convolution layers Conv1, conv2 and Conv3 into a hyper-parameter characteristic convolution layer Conv 4; performing feature pooling addition on the convolutional layers Conv5 and Conv6 to obtain convolutional layers Conv7 and Conv8, respectively; pooling the convolutional layer Conv8 to obtain a convolutional layer Conv9; performing deconvolution operation on the convolutional layer Conv9 and the convolutional layer Conv8 to obtain a deconvolution layer Deconv8 and a deconvolution layer Deconv9 respectively; the improved SSD detection network is obtained.

In a specific implementation process, the existing basic SSD network extracts high-dimensional features through a constant convolution operation on the original image to realize target identification. Higher-order features can be obtained by performing convolution once, but the resolution of the image is continuously compressed, so that for video satellite data, the target in the image is small, and the feature information of the target really concerned after several convolutions is seriously lost and cannot be identified, so that the accuracy of target detection is not high. Therefore, a basic framework based on the SSD detection network is improved, the improved SSD detection network is obtained, the detection accuracy of the network on the small target is improved, and the improved SSD detection network can be named as QDSSD (Quality deviation Single Shot Multi Box Detector) detection network.

Fig. 9 is a schematic diagram of an improved SSD detection network structure provided in an embodiment of the present application, where symbol C represents a convolutional Layer (convolutional Layer), DC represents a Deconvolution Layer (deconstruction Layer), and Pool represents a Pooling Layer (Pooling Layer). In the figure, the convolutional layer Conv4, which is C4, is a super-convolutional feature layer in which features of C1 (convolutional layer Conv 1), C2 (convolutional layer Conv 2), and C3 (convolutional layer Conv 3) are unified by a scale conversion method and then stacked, and the features of the layer are extracted from convolution features of the previous layers, and on the premise that the subsequent feature learning is continued.

The method has the advantages that C5 (convolutional layer Conv 5) features are added to C7 (convolutional layer Conv 7) after being pooled, C6 (convolutional layer Conv 6) features are added to C8 (convolutional layer Conv 8) after being pooled, bottom layer information of C7 (convolutional layer Conv 7) and C8 (convolutional layer Conv 8) is enriched, information amount is improved, extra parameters are not required to be introduced in the step, the purpose of enriching target convolutional features is achieved, and meanwhile parameter space of the whole network is reduced.

In the actual target detection process, in each convolution, the network brings the learned characteristics of the last convolution into the convolution and enriches the characteristics to the subsequent detection layer continuously through nonlinear transformation, so that the characteristics of small targets can be continued even if many convolutions are performed, the mobility of effective information is improved, the characteristics of the bottom layer are enriched into the characteristics of the high layer, information complementation is realized in different high layers, and target identification is facilitated.

Fig. 10 is a schematic diagram of a construction flow of a deconvolution module of an improved SSD detection network according to an embodiment of the present application, where C8 (convolutional layer Conv 8) is subjected to a pooling operation to obtain a C9 (convolutional layer Conv 9) convolution characteristic, and based on the characteristics of C9 (convolutional layer Conv 9) and C8 (convolutional layer Conv 8), DC8 (deconvolution layer Deconv 8) and DC9 (deconvolution layer Deconv 9) are obtained by a deconvolution operation, respectively. Taking the convolutional layer Conv8 as an example, when detecting the convolutional layer Conv8, the features of the convolutional layer Conv8 are combined with the features of the deconvolution layer Deconv8 at the same scale after nonlinear transformation, and the regression of the category and the regression of the position are performed together. Meanwhile, in order to control the Quality of the features, C8 (convolutional Layer Conv 8) and DC8 (deconvolution Layer Deconv 8) are respectively trimmed, the convolutional features of the high channel are transformed to the dimension with the same number of input samples by the Crop Layer, and then global optimization is performed by a Quality control Loss function (Quality Loss). And transforming the features into a specified scale space by deconvolution of the high-level semantic features, and simultaneously participating in object detection.

The construction of the improved SSD detection network is that a VGG convolutional network in a classic SSD network is used as a backbone network for feature extraction, and convolutional features of different layers are used as input of a detection layer, so that the improved SSD detection network is well suitable for the multi-scale problem of multiple targets, and the targets are detected on different scales. From the aspect of feature extraction, the bottom layer features of the convolutional network are more prone to expressing basic information such as textures, shapes and the like of the image, and the closer to the input layer, the more basic the features extracted by the network are, the basic detail features of the target are; and the high-level characteristics are closer to abstract characteristics of an expression object due to the fact that the reception field of the high-level characteristics is larger and larger, and the characteristics extracted by the network are more abstract as the high-level characteristics are closer to the output layer. Constructing a hyper-parameter feature convolution layer, adding bottom layer features into a high-layer feature space through nonlinear transformation, and enriching the features to high layers so as to facilitate target identification; the construction of the deconvolution layer can introduce extra context information of the previous scale when target detection is executed, and the accuracy of object detection is improved.

As an optional implementation, after the step of obtaining the improved SSD detection network, the method further includes: acquiring video satellite data with a label; carrying out data processing on the video satellite data with the tag to obtain a data set; and training the improved SSD detection network by using the data set to obtain the target detection model.

In the implementation process, the improved SSD detection network needs to be trained after being obtained. And carrying out data processing on the labeled video satellite data to form a data set, wherein the data set comprises a training set and a testing set, and training the improved SSD detection network by using the data set to obtain a target detection model.

As an optional implementation, the step of training the improved SSD detection network by using the data set to obtain the target detection model includes: optimizing the improved SSD detection network by a loss function as follows:

wherein x is a target discrimination variable, c is a confidence coefficient, l is a prediction box, g is a true box, N is a default box number,

in order to be a function of the confidence loss,

in order to locate the function of the loss,

as a function of location loss

The weight of (a) is determined,

is composed of

The weight of (c).

In the implementation, pass through the loss function

For improvement of SThe SD detection network is trained, wherein x is used for judging whether the designed feature capture box has a corresponding target:

indicating whether the ith box is matched with the boundary box of the jth target of the pth object, the matching is 1, otherwise, the matching is 0, if so, the method can not be used for detecting the ith box

Indicating that at least one box matches the jth target bounding box;

a loss function of the network is detected for a standard SSD,

for the confidence loss function (also called class loss function),

is a localization loss function. In order to achieve better training of the parameters of the deconvolution layer of the improved SSD detection network, the reconstruction error of the stacked autoencoder is introduced as a deconvolution quality control function, i.e. the

。

As an alternative embodiment, the expression of the quality control loss function is:

wherein M is ₁ To represent

Number of layer feature channels, M ₂ To represent

Layer characteristicsThe number of the channels is such that,

as a result of the normalization of the convolutional layer Conv7,

as a result of normalization of the convolutional layer Conv8,

as a result of the normalization of the deconvolution layer Deconv9,

is the normalized result of the deconvolution layer Deconv 8.

In the specific implementation process, for the characteristic layer after deconvolution, a characteristic reconstruction error function is introduced as a quality control loss function,

namely, the convolution quality control cost functions of the deconvolution layer Deconv8 and the deconvolution layer Deconv 9. Considering that no normalization operation is performed on the convolution characteristics in the VGG convolution network, the value of each dimension in the deconvolution characteristics is in the whole real number, which leads to that the result is easily diverged when the cost function is solved, and the training is difficult. Therefore, before reconstructing the error, the corresponding convolutional layers are respectively subjected to a Batch Normalization (BN) operation, so that the input is normalized to [0, 1 ] before calculating the error each time]In the meantime.

As an optional implementation manner, after the step of inputting the frame to be detected into a pre-trained target detection model for target recognition to obtain a target detection result of the frame to be detected if the content of the frame to be detected is inconsistent with the content of the key frame, the method further includes: and taking the frame to be detected as a new key frame.

In the specific implementation process, when the content of the frame to be detected is determined to be inconsistent with the content of the key frame, the frame to be detected needs to be used as a new key frame, and the target needs to be locked again for target identification of the subsequent frame to be detected.

As an optional implementation manner, the target detection model includes a trained hyper-parametric feature convolutional layer, a deconvolution layer and a convolutional layer; the step of inputting the frame to be detected into a pre-trained target detection model for target recognition to obtain a target detection result of the frame to be detected comprises the following steps: inputting the frame to be detected; performing convolution and deconvolution operations on the frame to be detected through the trained hyper-parameter feature convolution layer, the convolution layer and the deconvolution layer to obtain the features of the frame to be detected; and obtaining a target detection result of the frame to be detected according to the characteristics.

In the specific implementation process, each convolution layer, super-parameter characteristic convolution layer and deconvolution layer in the improved target detection model carry out convolution and deconvolution on the input frame to be detected, rich and accurate characteristic information is extracted, classification is carried out, and a more accurate target detection result is obtained.

S40: and if the content of the frame to be detected is consistent with that of the key frame, searching a local target by using the key frame to obtain a target detection result of the frame to be detected.

In the specific implementation process, when the contents of the frame to be detected and the key frame are consistent, it is indicated that the characteristics of the target to be detected do not change too much, for example, the approximate form of the target aircraft does not change, and the target detection result of the frame to be detected can be obtained only by locating the specific position of the target motion, i.e., performing local target search. The embodiment introduces a content discrimination network by combining the characteristics of the satellite video, and utilizes a frame skipping detection strategy to improve the target detection speed quickly.

In order to show more clearly that the improved SSD detection network in this embodiment can improve the detection accuracy of the small target, the present application also performs experimental comparison between the target detection model and the standard SSD model, which specifically includes the following steps:

in the experimental comparison, the training and testing system environment of the model is Linux (Ubuntu 16.04) system, the hardware environment is Intel (R) Xeon (R) CPU E5-2620 v3@2.40GHz, and the NVIDIA GeForce GTX TITAN X GPU and 128GB memory with 12GB video memory are adopted. In the system environment of Ubuntu 16.04, a Caffe deep learning framework (written by Giardia of UC Berkeley university, etc.) of a GPU version is installed as a basic experiment platform, and tools such as Python 3.6.0 and OpenCV 3.0 are also configured for auxiliary development. The video satellite data used is 30 seconds and 25 frames per second, the shooting places are all airports with specified longitude and latitude, the scene comprises dynamic and static airplanes, and in the experiment, the airplanes are detection targets. Target detection data of a target detection model and a standard SSD model under the same conditions are obtained in an experiment, and related data graphs of fig. 11-16 are obtained, wherein IOU is an overlapping Rate between a detection result and a real result, PROB is a probability value representing a certain type of target in the detection result, and Recall Rate (Recall Rate) is measurement of a coverage surface, and the data graphs are used for evaluating the recognition capability of a trained classifier on a sample, representing the ratio of a correctly detected target to a true value, also called Recall Rate (Precision Rate), representing the number of correctly detected target samples and the ratio considered as the detection target, also called Precision Rate.

Fig. 11 is a schematic diagram of a detection effect of a target detection model provided in the embodiment of the present application, in which an average accuracy (mAP) is 0.90227, and fig. 12 and 13 are schematic diagrams of an accuracy and a recall ratio of the target detection model provided in the embodiment of the present application, respectively, where an accuracy of detection and identification of an aircraft is greater than or equal to 0.8 and the recall ratio is greater than or equal to 0.8 when an IOU is about 0.5 and a PROB is 0.5-0.6. Fig. 14 is a schematic diagram of a detection effect of a standard SSD detection network provided in this embodiment, where an average accuracy (mAP) is 0.876962, fig. 15 and fig. 16 are schematic diagrams of an accuracy and a recall ratio of the standard SSD detection network provided in this embodiment, respectively, where an accuracy of detection and identification of an aircraft is above 0.8 and a recall ratio is above 0.8 when an IOU is about 0.4 and a PROB is 0.5, and it can be seen from fig. 14 that a part of nearby targets are not detected, and a scale of the undetected targets is smaller than that of nearby targets.

Summarizing the above experimental results, by comparing the accuracy and the recall rate of the aircraft target detection, the target detection model of the application has a flat curved surface of accuracy in the given interval of 0.1-0.5 IOU and 0.5-0.8 PROB, and has a flat curved surface of the recall rate in the given interval of 0.0-0.5 IOU and 0.0-0.8 PROB, so that the performance of the model in the data set is stable. In the same IOU and PROB interval, the curved surface of the accuracy rate and recall rate of the standard SSD model has obvious downward sliding, and the performance of the model in the data set is general. Compared with a basic SSD detection model, the target detection model has higher accuracy rate and is more suitable for small-size targets.

It should be understood that the above is only an example, and the technical solution of the present application is not limited in any way, and those skilled in the art can make the setting based on the actual application, and the setting is not limited herein.

According to the characteristics of large data size and large information amount of the video satellite, the consistency of the key frame and the frame to be detected is judged through the content judgment network, the target is consistent and directly searched, and if the target is inconsistent, the target is identified through the detection network, so that the target detection speed is improved well by utilizing a frame skipping detection strategy; meanwhile, a target detection model is obtained by improving the conventional SSD detection network, so that the method is well suitable for the multi-scale problem of multiple targets, and the targets are detected on different scales; the hyper-parameter characteristic convolution layer in the target detection model can enrich the bottom layer characteristics to a high layer so as to facilitate target identification, the deconvolution layer in the target detection model can introduce additional context information of a previous level scale when target detection is executed, and the accuracy of object detection is improved, namely the hyper-parameter characteristic convolution layer and the deconvolution layer are utilized to pay more attention to the detection of a small target, so that the accuracy of overall target detection is improved; furthermore, the reconstruction error of the deconvolution is used as a part of the loss function, so that the quality of the deconvolution can be better monitored while the target is detected, and the target detection precision is ensured.

Referring to fig. 3, based on the same inventive concept, an embodiment of the present application further provides an apparatus for detecting a target based on video satellite data, including:

the target detection model module is used for inputting the frame to be detected into a pre-trained target detection model for target identification if the content of the frame to be detected is inconsistent with that of the key frame, so as to obtain a target detection result of the frame to be detected; the target detection model is obtained based on training of an improved SSD detection network, and the improved SSD detection network comprises a hyper-parameter feature convolution layer and a deconvolution layer;

It should be noted that, in the present embodiment, each module in the target detection apparatus based on video satellite data corresponds to each step in the target detection method based on video satellite data in the foregoing embodiment one by one, and therefore, the specific implementation of the present embodiment may refer to the implementation of the target detection method based on video satellite data, and is not described herein again.

Furthermore, in an embodiment, an embodiment of the present application further provides a computer device, which includes a processor, a memory, and a computer program stored in the memory, and when the computer program is executed by the processor, the steps of the method in the foregoing embodiments are implemented.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories. The computer may be a variety of computing devices including intelligent terminals and servers.

In some embodiments, the executable instructions may be in the form of a program, software module, script, or code written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may, but need not, correspond to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices located at one site or distributed across multiple sites and interconnected by a communication network.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of other like elements in a process, method, article, or system comprising the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description, and do not represent the advantages and disadvantages of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., a rom/ram, a magnetic disk, an optical disk) and includes instructions for enabling a multimedia terminal (e.g., a mobile phone, a computer, a television receiver, or a network device) to execute the method according to the embodiments of the present application.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are included in the scope of the present application.

Claims

1. A target detection method based on video satellite data is characterized by comprising the following steps:

judging a network based on preset content, and carrying out consistency detection on the frame to be detected and a preset key frame; the key frame comprises a marked target to be detected;

if the content of the frame to be detected is inconsistent with the content of the key frame, inputting the frame to be detected into a pre-trained target detection model for target identification to obtain a target detection result of the frame to be detected; the target detection model is obtained based on training of an improved SSD detection network, and the improved SSD detection network comprises a hyper-parameter feature convolution layer and a deconvolution layer; before this step, also include: acquiring the standard SSD detection network; scaling the convolution layers Conv1, conv2 and Conv3 of the SSD detection network, and unifying the characteristics of the convolution layers Conv1, conv2 and Conv3 into a hyper-parameter characteristic convolution layer Conv 4; performing feature pooling addition on the convolutional layers Conv5 and Conv6 to obtain convolutional layers Conv7 and Conv8, respectively; pooling the convolutional layer Conv8 to obtain a convolutional layer Conv9; performing deconvolution operation on the convolutional layer Conv9 and the convolutional layer Conv8 to obtain a deconvolution layer Deconv8 and a deconvolution layer Deconv9 respectively; obtaining the improved SSD detection network; training the improved SSD detection network through the following loss functions to obtain the target detection model:

wherein x is a target discrimination variable, c is a confidence coefficient, L is a prediction frame, g is a true frame, N is a default frame number, and L _conf As a function of confidence loss, L _loc For the localization loss function, α is the localization loss function L _loc Weight of (1), L _quality Quality control loss function for deconvolution layer Deconv8 and deconvolution layer Deconv9, β is L _quality The weight of (c);

2. The method as claimed in claim 1, wherein the step of performing the consistency check on the frame to be detected and the preset key frame based on the preset content discrimination network further comprises:

acquiring a satellite video image with a label;

constructing the content discrimination network by using the satellite video image with the label; the content discrimination network comprises a parameter sharing convolution network and a full connection layer.

3. The method for detecting targets based on video satellite data according to claim 1, wherein after the step of inputting the frame to be detected into a pre-trained target detection model for target recognition to obtain the target detection result of the frame to be detected, if the content of the frame to be detected is inconsistent with the content of the key frame, the method further comprises:

and taking the frame to be detected as a new key frame.

4. The video satellite data-based target detection method of claim 1, wherein the target detection model comprises trained hyper-parametric feature convolutional layers, deconvolution layers, and convolutional layers; the step of inputting the frame to be detected into a pre-trained target detection model for target recognition to obtain a target detection result of the frame to be detected comprises the following steps:

inputting the frame to be detected;

5. The video satellite data-based object detection method of claim 1, wherein said step of obtaining said improved SSD detection network is followed by the step of:

acquiring video satellite data with a label;

6. The video satellite data-based object detection method of claim 1, wherein the expression of the quality control loss function is:

wherein M is ₁ To represent

Number of layer feature channels, M ₂ Represents->

The number of layer feature channels is such that, device for selecting or keeping>

For the normalized result of the convolution layer Conv7>

For the normalized result of the convolution layer Conv8>

For the normalization result of the deconvolution layer Deconv9>

Is the normalized result of the deconvolution layer Deconv 8.

7. An object detection device based on video satellite data, comprising:

the target detection model module is used for inputting the frame to be detected into a pre-trained target detection model for target identification if the content of the frame to be detected is inconsistent with that of the key frame, so as to obtain a target detection result of the frame to be detected; the target detection model is obtained based on training of an improved SSD detection network, and the improved SSD detection network comprises a hyper-parameter feature convolution layer and a deconvolution layer; before this step, still include: acquiring the standard SSD detection network; scaling the convolution layers Conv1, conv2 and Conv3 of the SSD detection network to unify the characteristics of the convolution layers Conv1, conv2 and Conv3 into a hyper-parameter characteristic convolution layer Conv 4; performing feature pooling addition on the convolutional layers Conv5 and Conv6 to obtain convolutional layers Conv7 and Conv8, respectively; pooling the convolutional layer Conv8 to obtain a convolutional layer Conv9; performing deconvolution operation on the convolutional layer Conv9 and the convolutional layer Conv8 to obtain a deconvolution layer Deconv8 and a deconvolution layer Deconv9 respectively; obtaining the improved SSD detection network; training the improved SSD detection network through the following loss functions to obtain the target detection model:

8. A computer device, characterized in that it comprises a memory in which a computer program is stored and a processor which executes said computer program implementing a video satellite data based object detection method as claimed in any one of claims 1-6.