CN113160050A

CN113160050A - Small target identification method and system based on space-time neural network

Info

Publication number: CN113160050A
Application number: CN202110319609.5A
Authority: CN
Inventors: 刘绍辉; 梁智博; 姜峰; 付森
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2021-03-25
Filing date: 2021-03-25
Publication date: 2021-07-23
Anticipated expiration: 2041-03-25
Also published as: CN113160050B

Abstract

The invention discloses a small target identification method and a system based on a space-time neural network, wherein the method comprises the following steps: preprocessing an original blurred image by using a super-resolution algorithm to obtain a high-quality image sequence; carrying out logic subtraction operation on adjacent frames of the high-quality image sequence by utilizing a space-time attention mechanism, and capturing and highlighting a suspicious region; extracting depth features in the suspicious region to obtain a feature map time sequence; inputting the characteristic diagram time sequence into a mapping device with confidence output by adopting an LSTM state transfer subnet to obtain a transfer state; and classifying the transition state by using a classifier to obtain a final recognition result, wherein the final recognition result is a target type and a confidence rate. The method is characterized in that the model is self-corrected along with the continuous reading of the frame sequence, and is gradually corrected into a correct category and the confidence rate is continuously improved.

Description

Small target identification method and system based on space-time neural network

Technical Field

The invention relates to the technical field of computer vision, in particular to a small target identification method and system based on a space-time neural network.

Background

With the development of the computer vision field, the target identification technology becomes a research hotspot and is widely applied to the fields of intelligent security, automatic driving, medical auxiliary diagnosis and the like. In practical applications, it is often impractical to require the target to be clearly and easily identified, which makes small target recognition technology more and more interesting in recent years. Some difficulties are prevalent in real-world application scenarios, such as: the small size of the target, the long distance of the target, the low resolution of the image source, etc. pose a serious challenge to the traditional algorithm using a single frame image as the identification basis.

In the current generation of a pervasive target recognition algorithm based on a deep network, a mainstream deep network model is basically adopted as a backbone network and an automatic feature extractor, and a final recognition result is given by a classifier. Because the data set containing a large number of pictures is used for training, the universal algorithms can usually achieve good effects when clearly identifiable objects are faced, but because operations such as convolution and the like are applied to different degrees in a backbone network, the characteristic resolution on a convolution channel is inevitably reduced, and further the performance of the algorithms is seriously degraded when the algorithms are faced with small target problems.

In recent years, related work is developed around the problem of small target recognition, and one idea is to improve the model itself from the perspective of recognizing the model by performing operations such as feature fusion of different sizes, expansion of the acceptance field, introduction of image context and the like, so as to improve the recognition capability of the model on the small target. Another idea is to restore a small target to a clearly recognizable signal as much as possible by means of data enhancement, super-resolution processing, and the like from the viewpoint of image source restoration. Although the two methods have a certain effect, the two methods can not meet the real-time and accurate requirements in a real scene because the two methods only work on a single-frame image.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

To this end, an object of the present invention is to provide a small target identification method based on a spatio-temporal neural network.

Another objective of the present invention is to provide a small target recognition system based on spatiotemporal neural network.

In order to achieve the above object, an embodiment of the invention provides a small target identification method based on a spatiotemporal neural network, which includes the following steps: step S1, acquiring an original blurred image at the current moment; step S2, preprocessing the original blurred image by using a super-resolution algorithm to obtain a high-quality image sequence; step S3, performing logic subtraction operation between adjacent frames of the high-quality image sequence by using a space-time attention mechanism, capturing and highlighting the suspicious region; step S4, extracting the depth features in the suspicious region to obtain a feature map time sequence; step S5, inputting the characteristic diagram time sequence into a mapping device of confidence output by adopting an LSTM state transfer subnet to obtain a corrected characteristic diagram time sequence; and step S6, classifying the corrected characteristic diagram time sequence by using a classifier to obtain a final recognition result, wherein the final recognition result is a target type and a confidence rate.

The small target identification method based on the space-time neural network solves the problem of identification performance reduction caused by the existing single-frame image target identification, after the area where the target is located is approximately locked, the visual capturer and the computing component resources are intensively and continuously distributed to the suspicious target located in the area, and the identification confidence rate is gradually improved through the continuous time sequence image capture within a certain time; meanwhile, along with the gradual operation of the model, some error conclusions obtained at the early stage are also corrected, so that the model has certain error correction performance.

In addition, the small target identification method based on the spatiotemporal neural network according to the above embodiment of the present invention may further have the following additional technical features:

further, in one embodiment of the present invention, the LSTM state transition subnet section employs a significant variant LSTM of the RNN recurrent neural network as a main component, wherein a complete significant variant LSTM cell structure includes an input gate, an output gate, a gate, and a forgetting gate.

Further, in one embodiment of the present invention, the structure of the LSTM cell of the one of the intact significant variants is:

wherein i is an input gate, f is a forgetting gate, o is an output gate, g is a gate, and sigmod function σ (x) is 1/(1+ e)^-x)，φ(x)＝(e^x-e^-x)/(e^x+e^-x) W is the weight matrix, B is the offset vector, x_tAnd h_t-1Is the current input.

Further, in an embodiment of the present invention, the transition state is:

wherein h is_tIs the output state of the current time sequence, t is the time sequence, o is the output gate, c_tIs a hidden state of the current time sequence, f is a forgetting gate, c_t-1In the hidden state of the previous time sequence, i is the input gate and g is the gate.

Further, in an embodiment of the present invention, any one of the deep convolution models is adopted as the backbone network in the step S4 and the step S6.

In order to achieve the above object, another embodiment of the present invention provides a small target recognition system based on a spatiotemporal neural network, including: the system comprises an acquisition module, a super-resolution module, a space-time attention module, a feature extraction module, an LSTM state transition subnet and a classification module, wherein the acquisition module is used for acquiring an original fuzzy image at the current moment; the super-resolution module is used for preprocessing the original blurred image to obtain a high-quality image sequence; the space-time attention module is used for carrying out logic subtraction operation between adjacent frames of the high-quality image sequence, capturing and highlighting a suspicious region; the feature extraction module is used for extracting the depth features in the suspicious region to obtain a feature map time sequence; the LSTM state transfer subnet is used for inputting the characteristic diagram time sequence into a mapping device with confidence output to obtain a corrected characteristic diagram time sequence; and the classification module is used for classifying the corrected characteristic diagram time sequence to obtain a final recognition result, wherein the final recognition result is a type and a confidence rate.

The small target recognition system based on the spatiotemporal neural network solves the problem of the reduction of the recognition performance caused by the conventional single-frame image target recognition, and after the area where the target is located is approximately locked, the visual capturer and the computing component resources are intensively and continuously distributed to the suspicious target in the area, and the recognition confidence rate is gradually improved through the continuous time sequence image capture for a certain time; meanwhile, along with the gradual operation of the model, some error conclusions obtained at the early stage are also corrected, so that the model has certain error correction performance.

In addition, the small target recognition system based on the spatiotemporal neural network according to the above embodiment of the present invention may further have the following additional technical features:

Further, in an embodiment of the present invention, the transition state is:

Further, in an embodiment of the present invention, any one of the deep convolution models is used as a backbone network in the feature extraction module and the classification module.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow chart of a method for identifying small targets based on a spatiotemporal neural network according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a relationship between attention distribution and recognition accuracy for a specific target category according to an embodiment of the present invention, wherein (a) the attention distribution is distracted and the recognition is incorrect; (b) correct identification for attention concentration;

FIG. 3 is a schematic diagram of the construction of an LSTM cell unit according to one embodiment of the present invention;

FIG. 4 is a sample pictorial illustration of ATSETC4 of an embodiment of the present invention;

FIG. 5 is a schematic diagram of model self-correction capability according to one embodiment of the invention;

FIG. 6 is a diagram illustrating the processing of different size images by SRGAN in accordance with an embodiment of the present invention;

FIG. 7 is a structural diagram of a small target recognition system based on a spatiotemporal neural network according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The method and system for identifying small targets based on the spatio-temporal neural network proposed by the embodiments of the present invention will be described below with reference to the accompanying drawings, and first, the method for identifying small targets based on the spatio-temporal neural network proposed by the embodiments of the present invention will be described with reference to the accompanying drawings.

FIG. 1 is a flow chart of a small target identification method based on a spatiotemporal neural network according to an embodiment of the present invention.

As shown in fig. 1, the small target identification method based on the spatiotemporal neural network comprises the following steps:

in step S1, the original blurred image at the current time is acquired.

In step S2, the original blurred image is preprocessed by using a super-resolution algorithm to obtain a high-quality image sequence.

Specifically, a super-resolution algorithm with a complete training model is used for carrying out initial enhancement on an original blurred image to obtain a data source with better image quality, wherein any effective super-resolution method can be introduced into the super-resolution algorithm.

In step S3, a logical subtraction is performed between adjacent frames of the high-quality image sequence using a spatiotemporal attention mechanism to capture and highlight the suspicious region, so that the subsequent computing resources are more accurately allocated to the actual target.

Formally, the attention score Y of a model for a certain target is taken as the inner product of the weight w and the feature map a:

wherein A is a feature map, w is a neural network model weight, Y is an attention distribution score of a model inference process, relu is a linear rectification function,

the weight gradient of the model is specifically as follows:

in the formula (2), the first and second groups,

is a gradient weighted sum of each feature element. And combining the formula (1) and the formula (2) to obtain a formula (3), which is a final form of the attention score of the model to the fixed category:

in fact, the distribution of attention is closely related to the recognition accuracy, and as shown in fig. 2, when a false recognition occurs, the attention of the model becomes abnormally dispersed, whereas when the recognition is correct, the attention almost completely fits the target contour.

In step S4, the depth features in the suspicious region are extracted to obtain a feature map time series sequence.

That is, the suspicious region output by the spatio-temporal attention mechanism is accepted and its depth features are extracted as inputs to the LSTM state transition sub-network.

In step S5, the LSTM state transition subnet is used to input the signature graph timing sequence into the mapping device for confidence output, and a corrected signature graph timing sequence is obtained.

Further, the LSTM state transition subnet section employs LSTM, an important variant of RNN recurrent neural networks, as a main component. The traditional RNN recurrent neural network unit has the limitation of time length of stored contents due to the problem of gradient disappearance and is not easy to train. As shown in fig. 3, LSTM is a variant of the recurrent neural network specifically designed to solve such problems, and a complete LSTM cell structure includes an input gate, an output gate, a gate and a forgetting gate, which can transmit the current hidden state to the next time to participate in fusion calculation, and simultaneously avoid the memory storage duration limitation caused by the gradient disappearance problem of the general recurrent neural network, and the specific formula is as follows:

The output state and hidden state calculation process is as follows:

wherein h is_tIs the output state of the current time sequence, t is the time sequence, o is the output gate, c_tIs a hidden state of the current time sequence, f is a forgetting gate, c_t-1In the hidden state of the previous time sequence, i is an input gate, and g is a gate;

and then, correcting the characteristic diagram time sequence according to the output state and the hidden state.

In step S6, the classifier is used to classify the corrected feature map timing sequence to obtain a final recognition result, where the final recognition result is a target type and a confidence rate.

Specifically, the time-series output of LSTM cells (the corrected feature map time-series) is input into the classifier to obtain the final classification result.

It should be noted that, in step S4 and step S6, any one of the mainstream deep convolution models may be used as the backbone network, and similarly, any effective feature extractor and classifier may be substituted.

In addition, in order to carry out a quantification experiment, the invention constructs an empty small target sequence image challenge data set ATSETC4 to be used as a deep network training basis, and solves the defect that a special data set for training a serialization neural network is lacked in the prior art. The ATSETC4 contains 2400 video clips from real capture and network resources, taking into account a number of scenarios including wilderness, urban, virtual environments, and complex weather environments. As shown in fig. 4, the present invention sets four flight target categories in ATSETC 4: flying birds, hot air balloons, fixed wing drones and rotor drones. Six standard size image subsets are also set: 224 × 224, 112 × 112, 56 × 56, 28 × 28, 14 × 14, 7 × 7 for multi-scale contrast testing (the reason for this arrangement is to accommodate the parameter requirements of a network with full connection layer depth). In particular, the small-scale subset is sampled from the large-scale subset, and the original targets of different sizes are properly fused in the initialization stage of the large-scale subset, so as to increase the challenge of the data set. The final ATSETC4 contains 2400 sequences, each 25 frames in length, totaling 60000 images. In a general sense, a target size of 28 × 28 or less may be considered a small target.

Therefore, the specific working process of the small target identification method proposed by the present invention can be shown in table 1 below.

Furthermore, in practical tests, the method has a significant effect on identifying continuous target frames and has strong self-correction capability, as shown in fig. 5, in the early stage identification process of the model, transient error identification occurs due to reasons of signal blurring, small target, complex background and the like, and as the frame sequence is continuously read in, the model performs self-correction, gradually corrects the model into a correct type and continuously improves the confidence rate.

The small target identification method based on the spatiotemporal neural network proposed by the present invention is further explained by a specific embodiment.

Firstly, the SRGAN super-resolution model with the trained model is adopted to directly process the image in the embodiment of the invention.

Next, a cross entropy loss function is used as an optimization function, and the minimum blocksize is set to 16. Specifically, one sequence with 25 frames is set as the minimum batch unit. Initial learning rate of 10^-4The learning rate decreases by a 100-fold scaling factor as the experimental accuracy of the validation set stops rising significantly. In particular, the fully-connected layer in the model is regularized by a dropout mode in a training process, and the dropout factor is 0.5 (namely, a part of parameters of the fully-connected layer are randomly frozen to prevent overfitting). In addition, the method is mainly based on a VGG11 deep convolution network which is pre-trained on ImageNet and serves as a feature extractor, and most parameters of a convolution main network are kept in a frozen state in the training process. Accordingly, ATSETC4 is partitioned into a training set and a test set at an 8:2 ratio. Finally, 55-65 rounds of training are averagely needed on subsets with different sizes, and the time for averagely training the subset model with one size is 90 minutes. The experimental device is a single-card NVIDIA GTX 1080Ti GPU, and a machine learning framework adopts a Pythrch. The other models in the comparative experiment used default parameters, and the test phase was still based on ATSETC4 provided by the present invention.

In the experiment, the embodiment of the invention sets the simplified spatio-temporal neural network as Simple _ STNet (without a super-module), and the full version model is named as STNet. In comparison with some leading edge target identification network performance, as shown in table 2 below.

Table 2: simple STNet, STNet and various advanced recognition algorithm performance comparison

From table 2, it can be seen that both the simplified version of Simple STNet and the full version of STNet achieve the best performance on almost all size subsets of ATSETC4, and the degradation of full version of STNet at 7-scale is due to the fact that the super-resolution has exceeded the theoretical limit under the 32-fold down-sampling condition, the image restoration process is erroneous, and the performance is degraded. As shown in fig. 6, the results of the original images with different sizes and the SRGAN processing are shown in three rows from left to right: 224 size high definition map, 7 size low definition map and 7 size SRGAN processing results.

Therefore, the small target identification method based on the spatiotemporal neural network provided by the embodiment of the invention solves the problem of identification performance reduction caused by the existing single-frame image target identification, after the area where the target is located is approximately locked, the visual capturer and the computing component resources are intensively and continuously distributed to the suspicious target in the area, and the identification confidence rate is gradually improved through the continuous time sequence image capture for a certain time; meanwhile, along with the gradual operation of the model, some error conclusions obtained at the early stage are also corrected, so that the model has certain error correction performance.

Next, a small object recognition system based on a spatiotemporal neural network proposed according to an embodiment of the present invention will be described with reference to the accompanying drawings.

FIG. 7 is a schematic structural diagram of a small target recognition system based on a spatiotemporal neural network according to an embodiment of the present invention.

As shown in fig. 7, the system 10 includes: the system comprises an acquisition module 100, a super-resolution module 200, a spatiotemporal attention module 300, a feature extraction module 400, an LSTM state transition subnet 500 and a classification module 600.

The obtaining module 100 is configured to obtain an original blurred image at a current time. The super-resolution module 200 is configured to pre-process the original blurred image to obtain a high-quality image sequence. The spatiotemporal attention module 300 is used to perform a logical subtraction operation between adjacent frames of a high quality image sequence to capture and highlight suspicious regions. The feature extraction module 400 is configured to extract depth features in the suspicious region to obtain a feature map time sequence. The LSTM state transition sub-network 500 is used to input the signature graph timing sequence into the mapping device of the confidence output, resulting in a transition state. The classification module 600 is configured to classify the transition state to obtain a final recognition result, where the final recognition result is a category and a confidence rate.

Further, in one embodiment of the invention, the LSTM state transition subnet section employs as a major component the important variant LSTM of the RNN recurrent neural network, wherein a complete important variant LSTM cell structure includes an input gate, an output gate, a gate, and a forgetting gate.

Further, in one embodiment of the present invention, the structure of an entire major variant LSTM cell is:

Further, in one embodiment of the present invention, the transition state is:

wherein h is_tIs the output state of the current time sequence, t is the time sequence, o is the output gate, c_tIs a hidden state of the current time sequence, f is a forgetting gate, c_t-1I is the input gate in the hidden state of the previous time sequenceAnd g is a door.

Optionally, in an embodiment of the present invention, either one of the deep convolution model is adopted as the backbone network in the feature extraction module and the classification module.

It should be noted that the foregoing explanation of the embodiment of the small target identification method based on the spatio-temporal neural network is also applicable to the system, and is not repeated here.

According to the small target recognition system based on the space-time neural network, which is provided by the embodiment of the invention, the problem of the reduction of the recognition performance caused by the recognition of the existing single-frame image target is solved, after the area where the target is located is roughly locked, the visual capturer and the computing component resources are intensively and continuously distributed to the suspicious target in the area, and the recognition confidence rate is gradually improved through the continuous time sequence image capture for a certain time; meanwhile, along with the gradual operation of the model, some error conclusions obtained at the early stage are also corrected, so that the model has certain error correction performance.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A small target identification method based on a space-time neural network is characterized by comprising the following steps:

step S1, acquiring an original blurred image at the current moment;

step S2, preprocessing the original blurred image by using a super-resolution algorithm to obtain a high-quality image sequence;

step S3, performing logic subtraction operation between adjacent frames of the high-quality image sequence by using a space-time attention mechanism, capturing and highlighting the suspicious region;

step S4, extracting the depth features in the suspicious region to obtain a feature map time sequence;

step S5, inputting the characteristic diagram time sequence into a mapping device of confidence output by adopting an LSTM state transfer subnet to obtain a corrected characteristic diagram time sequence;

and step S6, classifying the corrected characteristic diagram time sequence by using a classifier to obtain a final recognition result, wherein the final recognition result is a target type and a confidence rate.

2. The spatiotemporal neural network-based small target identification method according to claim 1, wherein the LSTM state transition sub-network portion employs a significant variant LSTM of RNN recurrent neural network as a main component, wherein a complete significant variant LSTM cell structure comprises an input gate, an output gate, a gate and a forgetting gate.

3. The spatiotemporal neural network-based small target identification method of claim 2, wherein the structure of the one complete significant variant LSTM cell is:

4. The spatiotemporal neural network-based small target identification method according to claim 1, wherein the transition state is:

5. The spatio-temporal neural network-based small object recognition method according to claim 1, wherein any one of deep convolution models is adopted as a backbone network in the steps S4 and S6.

6. A small target recognition system based on a spatiotemporal neural network, comprising: an acquisition module, a super-resolution module, a time-space attention module, a feature extraction module, an LSTM state transfer subnet and a classification module, wherein,

the acquisition module is used for acquiring an original blurred image at the current moment;

the super-resolution module is used for preprocessing the original blurred image to obtain a high-quality image sequence;

the space-time attention module is used for carrying out logic subtraction operation between adjacent frames of the high-quality image sequence, capturing and highlighting a suspicious region;

the feature extraction module is used for extracting the depth features in the suspicious region to obtain a feature map time sequence;

the LSTM state transfer subnet is used for inputting the characteristic diagram time sequence into a mapping device with confidence output to obtain a corrected characteristic diagram time sequence;

and the classification module is used for classifying the corrected characteristic diagram time sequence to obtain a final recognition result, wherein the final recognition result is a type and a confidence rate.

7. The spatiotemporal neural network-based small target recognition system of claim 1, wherein the LSTM state transition sub-network portion employs a significant variant LSTM of the RNN recurrent neural network as a main component, wherein one complete significant variant LSTM cell structure comprises an input gate, an output gate, a gate and a forgetting gate.

8. The spatiotemporal neural network-based small object recognition system of claim 7, wherein the structure of the one entire significant variant LSTM cell is:

9. The spatiotemporal neural network-based small object recognition system of claim 6, wherein the transition state is:

10. The spatiotemporal neural network-based small target recognition system of claim 6, wherein either of the feature extraction module and the classification module employs a deep convolution model as a backbone network.