CN115359240A

CN115359240A - Small target detection method, device and equipment based on multi-frame image motion characteristics

Info

Publication number: CN115359240A
Application number: CN202210831015.7A
Authority: CN
Inventors: 陈凡浩; 汪孝文; 赖林
Original assignee: Beijing Zhongkesichuang Cloud Intelligent Technology Co ltd
Current assignee: Beijing Zhongkesichuang Cloud Intelligent Technology Co ltd
Priority date: 2022-07-15
Filing date: 2022-07-15
Publication date: 2022-11-18
Anticipated expiration: 2042-07-15
Also published as: CN115359240B

Abstract

The present disclosure provides a small target detection method, device and equipment based on multi-frame image motion characteristics, wherein the method comprises the following steps: acquiring a current image frame from video stream data, and cutting out each first candidate area containing a small target from the current image frame by adopting an updated search window; the searching window is updated according to the searching window coordinate determined by the previous image frame adjacent to the current image frame time sequence; and screening out the area where the small target is located from each first candidate area. In the method, the search window is updated by using the search window coordinates determined by the previous image frame adjacent to the current image frame in time sequence, and then the small target detection is carried out on the current image frame based on the updated search window, so that the small target can be accurately detected by combining the motion characteristics between the adjacent image frames.

Description

Small target detection method, device and equipment based on multi-frame image motion characteristics

Technical Field

The present disclosure relates to the field of target detection technologies, and in particular, to a method, an apparatus, and a device for detecting a small target based on motion characteristics of multiple frames of images.

Background

The deep learning outbreak raises a new wave of artificial intelligence, and the target detection technology based on the deep learning also makes dramatic progress and is widely applied to traffic monitoring, intelligent security cameras and biological authentication scenes.

Under the prosperous situation, the application scenes are designed based on fixed scenes such as indoor or urban environments and the like by carefully researching the existing target detection algorithm principle. Under these static scenarios: the image for target detection is clear, a large number of images can be obtained repeatedly easily, and priori knowledge such as the shape, texture and color of the object to be detected can be preset; in addition, the target in the picture has good significance, a large occupied area and obvious 2D geometric characteristics such as color, outline, lines and the like, so that the targets in the static scenes can be accurately detected through the existing target detection algorithm.

Since the advent of Unmanned Aerial Vehicle (UAVs), small object detection has attracted considerable attention in the field of computer vision. UVA mainly operates in low latitude airspace, and along with the development and utilization and the scientific and technological development of this field resource, all kinds of small-size unmanned aerial vehicle increase day by day, and the applied field also constantly enlarges.

When a detector such as an unmanned aerial vehicle is used for shooting small targets (such as birds, unmanned aerial vehicles and ground pedestrians) moving in the ground and the air, the small targets usually have a long distance from a camera, so that the pixels occupied by the small targets in an image are small. For these small targets, not only are the visual characteristics such as the shape, texture, and color of the image limited, but also the motion characteristics may further cause the shape deformation in the image or only capture the afterimage thereof, so that even if the unmanned aerial vehicle is equipped with a high-resolution camera, it is difficult to implement accurate detection of the small targets on the video stream and the image stream returned by the unmanned aerial vehicle through the existing target detection algorithm.

Disclosure of Invention

In view of this, the present disclosure provides a small target detection method, apparatus and device based on multi-frame image motion characteristics, which can accurately detect a small target for a video stream and an image stream returned by an unmanned aerial vehicle.

According to a first aspect of the present disclosure, a small target detection method based on multi-frame image motion features is provided, including:

acquiring a current image frame from video stream data;

cutting out each first candidate area containing a small target from the current image frame by adopting an updated search window; the search window is updated according to the search window coordinate determined by the previous image frame adjacent to the current image frame time sequence;

and screening out the area where the small target is located from each first candidate area.

In one possible implementation manner, when updating the search window according to the search window coordinate determined by the previous image frame chronologically adjacent to the current image frame, the method includes:

acquiring second candidate areas containing the small targets cut out from the previous image frame;

extracting a first feature map corresponding to each second candidate region;

calculating optical flow time sequence characteristics corresponding to the second candidate areas according to the first characteristic graphs corresponding to the second candidate areas;

and determining the coordinates of the search window according to the optical flow time sequence characteristics corresponding to the second candidate areas, and updating the search window for cutting the previous image frame according to the coordinates of the search window.

In a possible implementation manner, when the previous image frame is an initial image frame, each second candidate region containing the small target is cut out from the previous image frame based on a preset initial search window.

In one possible implementation, when the search window is updated according to the search window coordinates determined by the initial image frame, the method includes:

acquiring second candidate areas containing the small targets cut out from the initial image frame;

extracting a second feature map corresponding to each second candidate region;

and determining the coordinates of the search window according to the second feature map corresponding to each second candidate region, and updating the search window for cutting the initial image frame according to the coordinates of the search window.

In a possible implementation manner, the optical flow time-series features corresponding to the second candidate regions are constructed based on a ConvLSTM neural network when the optical flow time-series features corresponding to the second candidate regions are calculated according to the first feature map corresponding to the second candidate regions.

In one possible implementation manner, when determining the search window coordinate according to the optical flow time-series feature corresponding to each of the second candidate areas, the method includes:

calculating the confidence degree and the candidate search window coordinate corresponding to each second candidate area according to the optical flow time sequence feature corresponding to each second candidate area;

and determining the coordinates of the search window according to the confidence coefficient corresponding to each second candidate area and the coordinates of the candidate search window.

In a possible implementation manner, when the region where the small target is located is screened out from each of the first candidate regions, the method includes:

extracting a first feature map corresponding to each first candidate region;

calculating optical flow time sequence characteristics corresponding to the first candidate areas according to the first characteristic graphs corresponding to the first candidate areas;

performing point multiplication on the optical flow time sequence characteristics corresponding to the first candidate areas and the optical flow time sequence characteristics corresponding to the candidate areas cut out from the previous image frame to obtain candidate search window coordinates corresponding to the first candidate areas;

calculating the confidence degree corresponding to each first candidate area based on the optical flow time sequence characteristics corresponding to each first candidate area;

and screening out the region where the small target is located based on the candidate search window coordinate corresponding to each first candidate region and the confidence coefficient.

In one possible implementation manner, when performing point multiplication on the optical flow time-series feature corresponding to each first candidate area and the optical flow time-series feature corresponding to the candidate area cut out from the previous image frame to obtain a candidate search window coordinate corresponding to each first candidate area, the method includes:

performing point multiplication on the optical flow time sequence characteristics corresponding to the first candidate areas and the optical flow time sequence characteristics corresponding to the candidate areas cut out from the previous image frame to obtain cross-correlation thermodynamic diagrams corresponding to the first candidate areas;

and calculating candidate search window coordinates corresponding to each first candidate region according to the cross-correlation thermodynamic diagram corresponding to each first candidate region.

According to a second aspect of the present disclosure, there is provided a small object detection apparatus based on motion characteristics of multiple frames of images, including:

the image frame acquisition module is used for acquiring a current image frame from video stream data;

the candidate area cutting module is used for cutting out each first candidate area containing the small target from the current image frame by adopting the updated search window; the searching window is updated according to the searching window coordinate determined by the previous image frame adjacent to the current image frame time sequence;

and the target detection module is used for screening out the area where the small target is located from each first candidate area.

According to a third aspect of the present disclosure, there is provided a small object detection apparatus based on a motion feature of a plurality of frames of images, comprising: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to perform the method of the first aspect of the present disclosure.

In the method, the search window is updated by using the search window coordinates determined by the previous image frame adjacent to the current image frame in time sequence, and then the small target detection is carried out on the current image frame based on the updated search window, so that the small target can be accurately detected by combining the motion characteristics between the adjacent image frames.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a schematic flow chart of a small target detection method based on multi-frame image motion characteristics according to an embodiment of the present disclosure;

FIG. 2 shows a schematic network structure diagram of a second convolution module according to an embodiment of the present disclosure;

FIG. 3 illustrates a workflow diagram of a cross-correlation layer according to an embodiment of the present disclosure;

fig. 4 shows a schematic flow chart of an example of a small target detection method based on multi-frame image motion features according to an embodiment of the present disclosure;

FIG. 5 is a schematic block diagram of a small target detection device based on the motion characteristics of multiple frames of images according to an embodiment of the present disclosure;

fig. 6 shows a schematic block diagram of a small target detection device based on multi-frame image motion characteristics according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

< method example >

Fig. 1 shows a schematic flow chart of a small target detection method based on multi-frame image motion characteristics according to an embodiment of the present disclosure. As shown in fig. 1, the method includes steps S1100-S1300.

And S1100, acquiring the current image frame from the video stream data.

The video stream is composed of image frames acquired by the unmanned aerial vehicle according to a time sequence. After receiving a video stream returned by the unmanned aerial vehicle, reading image frames in the video stream in sequence and processing the image frames, wherein the current image frame is an image frame to be processed currently.

S1200, cutting out each first candidate area containing the small target from the current image frame by adopting the updated search window.

The small target is a detection target with the pixel size distribution width smaller than a set width threshold value in the image frame. The detection target may be a small flying object (such as a bird or a drone), or a pedestrian on the ground, and is not limited in particular. Wherein the set width threshold is typically 25 pixels. It should be noted that the small target detection method of the present disclosure is also applicable to detection of a detection target of a normal size.

In the present disclosure, the small target may be searched and detected by using a Selective Search algorithm (Selective Search), or may be searched and detected by using an RCNN algorithm, which is not particularly limited herein. When the selective search algorithm is used for searching and detecting the small target, the search window is the search window used in the selective search algorithm. Similarly, when the RCNN algorithm is used for searching and detecting the small target, the search window is the search window used in the RCNN algorithm.

In the present disclosure, the search window for clipping the current image frame is obtained by updating the search window coordinates determined according to the previous image frame adjacent to the current image frame in time sequence. For example, in an implementation manner of searching and detecting a small target by using a selective search algorithm, the selective search algorithm is firstly adopted to detect the small target of a previous image frame adjacent to the current image frame in time sequence, and a new search window coordinate is determined according to the previous image frame; updating the search window in the selective search algorithm according to the new search window coordinate; and finally, cutting out each first candidate area containing the small target from the current image frame by adopting the updated search window.

E.g. with the current image frame t _n The time sequence adjacent previous image frame is t _n-1 Firstly, the coordinate is (x) _n-1 ，y _n-1 ，w _n-1 ,，h _n-1 ) For the previous image frame t _n-1 Performing small target detection, and taking the previous image frame as t _n-1 The new search window coordinate is determined to be (x) _n ，y _n ，w _n ,，h _n ) Then using the coordinate as (x) _n ，y _n ，w _n ,，h _n ) The search window is formed by the current image frame t _n Each first candidate region containing a small target is cut out.

In one possible implementation manner, when the search window is updated according to the search window coordinates determined by the previous image frame adjacent to the current image frame in time sequence, the specific implementation steps include S1211-S1214.

S1211, second candidate regions including small objects cut out from the previous image frame are obtained. Continuing with the example, the coordinate is taken as (x) _n-1 ，y _n-1 ，w _n-1 ,，h _n-1 ) By the previous image frame t _n-1 Cutting out each second candidate area containing the small target.

And S1212, extracting the first feature map corresponding to each second candidate region.

In a possible implementation manner, when the first feature map corresponding to each second candidate region is extracted, the extraction may be implemented based on a first convolution layer that is constructed in advance. Specifically, the second candidate regions are input into the first convolution layer, respectively, so as to obtain the first characteristic diagram corresponding to the second candidate regions.

In a possible implementation manner, the first convolution layer may be constructed by selecting a corresponding convolution algorithm according to characteristics of the small target data and the distributed difference. For example, when the small target is a drone, the yolov3_ darknet detection algorithm may be selected to construct the first convolutional layer. For another example, where the small target is a bird, the yolov3_ tiny detection algorithm may be selected to construct the first convolutional layer.

It should be noted that, the yolov3_ dark net detection algorithm includes 53 convolutional layers, and the yolov3_ tiny detection algorithm includes 24 convolutional layers, so that the first convolutional layer constructed based on the yolov3_ dark net detection algorithm has better feature extraction capability and inference speed compared with the construction of the first convolutional layer by the yolov3_ dark detection algorithm, and is more suitable for performing feature extraction on small targets such as unmanned aerial vehicles with poor flight trajectory regularity and fast flight speed.

S1213, calculate the optical flow time-series feature corresponding to each second candidate region based on the first feature map corresponding to each second candidate region.

In a possible implementation manner, when the optical flow time-series features corresponding to each second candidate region are calculated according to the first feature map corresponding to each second candidate region, the calculation may be implemented based on a second convolution layer constructed in advance. Specifically, the optical flow time-series features corresponding to the second candidate regions can be obtained by inputting the first feature maps corresponding to the second candidate regions into the second convolution layers, respectively.

In one possible implementation, the second convolutional layer may be constructed based on an LSTM (Long Short-Term Memory) convolutional network junction. Specifically, replacing the inner product in the LSTM convolutional network structure with convolution results in the second convolutional layer in this disclosure. The feature extraction capability of the second convolutional layer can be improved by replacing the inner products in the LSTM convolutional network with convolutions.

In this implementation, the network structure diagram of the second convolutional layer may be as shown in fig. 2. Specifically, the calculation formula of the second convolution layer is as follows:

i _t ＝σ(ω _xi *x _t +ω _hi *h _t-1 +b _i ),

f _t ＝σ(ω _xf *x _t +ω _hf *h _t-1 +b _f ),

o _t ＝σ(ω _xo *x _t +ω _ho *h _t-1 +b _o ),

in the formula, h _t (i.e., H in FIG. 2) _t ) The optical flow time sequence characteristic of the output of each layer of neural network at a time step t, h _t-1 (i.e., H in FIG. 2) _t-1 ) For the output optical flow time sequence characteristic of each layer of neural network at the time step t-1, x _t (i.e., X in FIG. 2) _t ) For each layer of neural network, inputting a first characteristic diagram at a time step t, o _t (i.e., O in FIG. 2) _t ) For outputting the state of the gate at time step t, c _t (i.e., C in FIG. 2) _t ) For the state of the memory cell at time step t, c _t-1 (i.e., C in FIG. 2) _t-1 ) For the state of the memory cell at time step t-1, f _t To memorize the state of the gate in time steps t, i _t To forget the state of the gate at time step t, σ is the classification function, where σ can be 1/(1 + e ^ -x).

In another possible implementation, the second convolutional layer may also be constructed based on a ConvLSTM convolutional network. In this implementation, the calculation formula for the second convolution layer is as follows:

in the formula, H _t For the output light flow time sequence characteristic of each layer of neural network at time step t, H _t-1 For the output optical flow time sequence characteristic of each layer of neural network at the time step t-1, x _t Inputting a first characteristic diagram o of each layer of neural network at a time step t _t To output the state of the gate at time step t, c _t For the state of the memory cell at time step t, c _t-1 For the state of the memory cell at time step t-1, f _t To memorize the state of the gate at time step t, i _t To forget the state of the gate at time step t, σ is the classification function, where σ can be 1/(1 + e ^ -x).

The Gated current Unit (GRU) can have the same effect as the LSTM in network structure, but has fewer sequential gate structures, simpler operation process, and according to experience, the GRU structure network can more easily converge to the global optimum point on some data sets. Thus, in yet another possible implementation, the second convolution sublayer may also be constructed of GRUs.

In this implementation, the second convolution sublayer is calculated as follows:

z _t ＝σ(ω _xz *x _t +ω _hz *h _t-1 +b _z )

r _t ＝σ(ω _xz *x _t +ω _hz *h _t-1 +b _z )

in the formula, h _t For the output light stream time sequence characteristic of each layer of neural network at the time step t, h _t-1 For the output optical flow time sequence characteristic of each layer of neural network at the time step t-1, x _t First characteristic diagram of input of each layer of neural network at time step t, z _t To update the state of the gate at time step t, r _t To reset the state of the gate at time step t, σ is the classification function, where σ can be 1/(1 + e ^ -x).

And S1214, determining the coordinates of the search window according to the optical flow time sequence characteristics corresponding to the second candidate areas, and updating the search window for cutting the previous image frame according to the coordinates of the search window.

In one possible implementation, the steps S1214-1 to S1214-3 are included when determining the coordinates of the search window according to the optical flow time-series characteristics corresponding to the second candidate areas.

S1214-1, performing dot multiplication on the optical flow time-series feature corresponding to each second candidate region and the optical flow time-series feature corresponding to the candidate region cut out in the previous image frame to obtain candidate search window coordinates corresponding to each second candidate region.

It should be noted that the previous image frame in this step is referred to as a previous image frame t _n-1 The previous image frame, i.e. image frame t _n-2 . The image frame t may be obtained by referring to the above-described steps S1211-S1213 _n-2 Cutting out a plurality of candidate areas containing small targets, calculating and storing the optical flow time sequence characteristics of the plurality of candidate areas, thus reading the pre-stored image frame t from the memory when acquiring the optical flow time sequence characteristics corresponding to each second candidate area _n-2 The optical flow time-series characteristics corresponding to the plurality of cut-out candidate areas, and the dot product calculation in step S1214-1 is performed.

In one possible implementation, step S1214-1 may be implemented based on a cross-correlation layer. Referring to fig. 3, after the optical flow time-sequence features corresponding to the second candidate regions are respectively input into the cross-correlation layer, the cross-correlation layer reads the pre-stored image frame from the memoryt _n-2 The optical flow time sequence characteristics corresponding to the plurality of cut candidate areas are subjected to point multiplication with the optical flow time sequence characteristics corresponding to each second candidate area to obtain a cross-correlation thermodynamic diagram corresponding to each second candidate area, wherein the cross-correlation thermodynamic diagram represents the optical flow time sequence characteristics corresponding to the second candidate area and the image frame t _n-2 And (4) the correlation weight between the optical flow time-sequence characteristics corresponding to the cut candidate areas. And calculating candidate search window coordinates corresponding to each second candidate region according to the cross-correlation thermodynamic diagram corresponding to each first candidate region.

In this implementation, the cross-correlation thermodynamic diagram is calculated as follows:

where p is the coordinate of each pixel in the f domain, i.e. the optical flow time-series characteristic corresponding to the second candidate region, and q is the coordinate of each pixel in the h domain, i.e. the image frame t _n-2 The C (p) is a cross correlation thermodynamic diagram. The inner product is used here as a similarity measure.

In this implementation, the candidate search window coordinates are calculated as follows:

P _target ＝argmax _p C(p)

wherein C (P) is a cross-correlation thermodynamic diagram, P _targer Candidate search window coordinates.

For example, from a previous image frame t _n-1 The second candidate regions are respectively s ₁ 、s ₂ ……s ₃ 、s _s The optical flow time sequence characteristic corresponding to each second candidate area is p ₁ 、p ₂ 、p ₃ ……p _s . From the image frame t _n-2 The plurality of candidate regions cut out in the middle are respectively s _1’ 、s _2’ ……s _3’ 、s _s’ The optical flow time-series characteristics corresponding to each candidate region are q ₁ 、q ₂ 、q ₃ ……q _s . Below isBy s in the second candidate region ₁ For example, step S1241-1 will be explained.

Specifically, p is ₁ And q is ₁ Substituting calculation formula of cross correlation thermodynamic diagram to obtain optical flow time sequence characteristic p ₁ And optical flow timing feature q ₁ Cross correlation thermodynamic diagram C between ₁ (p) mixing C ₁ (p) substituting the candidate search window coordinates into a calculation formula to obtain a second candidate region s ₁ Corresponding candidate search window coordinates P _1targer 。

The candidate search window coordinates corresponding to each second candidate region can be obtained by referring to the above method, which is not described herein again.

S1214-2, calculating the confidence corresponding to each second candidate area based on the optical flow time-series characteristics corresponding to each second candidate area.

In one possible implementation, step S1214-2 may be implemented based on a full connectivity layer. Specifically, the optical flow time-series features corresponding to the second candidate regions are respectively input to the full connection layer, so that the confidence degrees corresponding to the second candidate regions can be calculated.

S1214-3, determining the coordinate of the search window based on the coordinate of the candidate search window corresponding to each second candidate area and the confidence coefficient. Specifically, the candidate search window coordinate corresponding to the second candidate region with the highest confidence is used as the final determined search window coordinate.

After the new coordinate of the search window is determined, the search window for cutting the previous image frame can be updated to obtain the search window for cutting the current image frame.

In one possible implementation, the previous image frame is an initial image frame (i.e., the first image frame t in the video stream) ₀ ) And cutting out each second candidate area containing the small target from the initial image frame based on a preset initial search window. The size of the initial search window may be set according to a specific application scenario. For example, the size of the initial search window may be set to [ [255.144 ]],[54,78],[84,112],[87,887]]。

In this implementation, steps S1221-S1223 are included in updating the search window based on the search window coordinates determined from the initial image frame.

And S1221, cutting out each second candidate area containing the small target from the initial image frame. In particular, an initial search window is employed to select from an initial image frame t ₀ Cutting out each second candidate area containing the small target.

S1222, extract a second feature map corresponding to each second candidate region.

In a possible implementation manner, when the second feature map corresponding to each second candidate region is extracted, the second feature map may be implemented based on a third convolutional layer that is constructed in advance. The third convolution module has the same structure as the first convolution layer network and shares the weight, which is not described herein again.

And S1223, determining the coordinates of the search window according to the second feature maps corresponding to the second candidate areas, and updating the initial search window according to the coordinates of the search window. Here, the step may refer to step S1214, which is not described herein again.

S1300, screening out the area where the small target is located from each first candidate area.

In one possible implementation, steps S1310-S1350 are included when the region where the small target is located is screened out from the first candidate regions.

S1310, a first feature map corresponding to each first candidate region is extracted. The specific steps can be referred to as step SS1212, which is not described herein again.

S1320, based on the first feature map corresponding to each first candidate region, the optical flow time-series feature corresponding to each first candidate region is calculated. For the specific steps, refer to step S1213, which is not described herein again.

S1330, performing dot multiplication on the optical flow time-series feature corresponding to each first candidate region and the optical flow time-series feature corresponding to the candidate region cut out from the previous image frame to obtain candidate search window coordinates corresponding to each first candidate region. For the specific steps, see step S1214-1, which is not described herein again.

S1340, calculating the confidence degree corresponding to each first candidate area based on the optical flow time-series characteristics corresponding to each first candidate area. For the specific steps, refer to step S1214-2, which is not described herein again.

And S1350, screening out the area where the small target is located based on the candidate search window coordinates and the confidence degrees corresponding to the first candidate areas. Specifically, a region within the candidate search window coordinates corresponding to the first candidate region with the highest confidence is taken as the region where the small target is located.

In the disclosure, a search window is updated by using search window coordinates determined by a previous image frame adjacent to a current image frame in time sequence, and then small target detection is performed on the current image frame based on the updated search window, so that accurate detection of the small target can be realized by combining motion characteristics between adjacent image frames.

< method example >

Fig. 4 shows a schematic flow chart of an example of a small target detection method based on the motion characteristics of multiple frames of images according to an embodiment of the present disclosure. As shown in fig. 4, the method steps include:

first, an image frame t is obtained from video stream data _n-1 . Wherein n-1 is not less than 2.

Second, use the coordinate as (x) _n-1 ，y _n-1 ，w _n-1 ,，h _n-1 ) From the previous image frame t _n-1 Each second candidate region containing a small target is cut out.

Thirdly, the second candidate regions are input to the first convolution layer (i.e., the upper single-frame convolution layer (a)), and the first feature map corresponding to each of the second candidate regions is extracted.

Fourth, the first feature maps corresponding to the second candidate regions are input to the second convolution layer (i.e., the convolution base layer convLSTM (B) for multi-frame representation), and the optical flow time-series features corresponding to the second candidate regions are calculated.

Fifthly, the optical flow time sequence characteristics corresponding to each second candidate area are input into a cross-correlation layer (i.e. a cross-correlation layer (C) for positioning) and an image frame _n-2 And performing point multiplication on the optical flow time sequence characteristics corresponding to the plurality of candidate areas cut out in the step (2) to obtain candidate search window coordinates corresponding to each second candidate area. Meanwhile, inputting the optical flow time sequence characteristics corresponding to each second candidate area into the full connectionThe confidence level corresponding to each second candidate region is calculated by layer connection (i.e., the fully connected layer (D) for object scoring).

Sixthly, screening out a second candidate region with the highest confidence coefficient, and selecting the candidate search window coordinate (x) corresponding to the second candidate region with the highest confidence coefficient _n ，y _n ，w _n ,，h _n ) New search window coordinates are determined and the search window is updated based on the new search window coordinates.

Seventh, using the search window coordinates as (x) _n ，y _n ，w _n ,，h _n ) By the current image frame t _n Each first candidate region containing a small target is cut out.

And eighthly, screening out the area where the small target is located from each first candidate area.

It should be noted that, in this embodiment, the features have been specifically explained in the embodiment, and therefore, the description of this method example is omitted.

It should be noted that the two single-frame convolutional layers (a), the convolutional base layer convLSTM (B) for multi-frame representation, the cross-correlation layer (C) for positioning, and the fully-connected layer (D) for object scoring in fig. 4 together constitute an operation Feature Extraction network (MFE-net) for executing the small object detection method of the present example.

Before the small target detection method of this example is executed, training data needs to be constructed in advance, and the operating feature extraction network is trained based on the training data.

In particular, a pointing camera is installed at a wind power plant. Videos were recorded for 14 days during the day (8. In the recorded videos, videos (including videos shot under challenging weather and illumination changes in a complex environment) with relatively frequent occurrences of small targets such as birds and pedestrians shot by an unmanned aerial vehicle for 3 days are selected, the small targets in video image frames are labeled, and the labeled small targets are used as the running feature extraction network training data. These videos are stored in MP4 format with a single file size of 128GB. Despite the high resolution, compression noise is a visible object in the image on fast moving. Except recording videos, images of small targets such as birds, unmanned planes and ground pedestrians with different sizes and speed distributions can be crawled on the network through a big data technology, the small targets are marked, and the small targets are used as the running features to extract network training data.

When the operation characteristic extraction network is trained based on the training data, the method comprises the following steps:

first, pre-training weights obtained by training the convolutional layers (a) of the above-mentioned single frames using the ILSVRC2012-CLS dataset are obtained, and the parameters of the convolutional layers (a) of the two single frames in the running feature extraction network are initialized based on the pre-training weights. Second, the parameters of the initialized convolutional layers (a) of the two single frames are fine-tuned based on the constructed training data. Thirdly, constructing a motion feature extraction network as shown in fig. 4 based on the parameter-trimmed convolution layers (a) of the two single frames, training the motion feature extraction network based on the constructed training data to realize the re-trimming of the parameters of the convolution layers (a) of the two single frames, and finally finishing the training of the motion feature extraction network.

In order to improve efficiency, the clipped search window is stored in a disk during training so as to avoid re-clipping the region of interest from the image frame during training, and further achieve the purpose of reducing disk access.

In one possible implementation, the total number of iterations of training is 40000, the batch (batch) value is 5, the initial learning rate is set to 0.01, and the reduction is 0.1 times per 10,000 iterations. The loss function adopts sigmoid cross entropy commonly used for detection. The inverse gradient propagation uses the Caffe's random gradient descent (SGD) solver.

< apparatus embodiment >

Fig. 5 shows a schematic block diagram of a small target detection device based on the motion characteristics of multiple frames of images according to an embodiment of the present disclosure. As shown in fig. 5, the small object detection apparatus 100 includes:

an image frame obtaining module 110, configured to obtain a current image frame from the video stream data.

A candidate region cutting module 120, configured to cut out, from the current image frame, each first candidate region including a small target by using the updated search window; and updating the search window according to the search window coordinate determined by the previous image frame adjacent to the current image frame time sequence.

And the target detection module 130 is configured to screen out a region where the small target is located from each first candidate region.

In one possible implementation, the object detection module 130 includes a first convolution layer, a second convolution layer, and a search window update layer, and when the search window is updated according to search window coordinates determined from a previous image frame adjacent to a current image frame timing:

a candidate region clipping module 120, configured to obtain second candidate regions including small objects clipped from a previous image frame;

the first convolution layer is used for extracting a first feature map corresponding to each second candidate area;

a second convolution layer for calculating optical flow time-series characteristics corresponding to each second candidate area by using the first characteristic diagram corresponding to each second candidate area as input data;

and the search window updating layer is used for determining the coordinates of the search window according to the optical flow time sequence characteristics corresponding to the second candidate areas and updating the search window used for cutting the previous image frame according to the coordinates of the search window.

In one possible implementation manner, when the previous image frame is an initial image frame, each second candidate region containing a small target is cut out from the previous image frame based on a preset initial search window.

In one possible implementation, the target detection module 130 further includes a third convolution layer that, when the search window is updated according to the search window coordinates determined from the initial image frame:

a candidate region clipping module 120, configured to obtain second candidate regions including small objects clipped from the initial image frame;

a third convolution layer for extracting a second feature map corresponding to each second candidate region;

and the search window updating layer is used for determining the coordinates of the search window according to the second feature map corresponding to each second candidate area and updating the search window used for cutting the initial image frame according to the coordinates of the search window.

In one possible implementation, the second convolutional layer is constructed based on a ConvLSTM neural network.

In a possible implementation manner, the search window updating layer, when determining the coordinates of the search window according to the optical flow time-series feature corresponding to each second candidate area, is specifically configured to:

calculating the confidence coefficient and the candidate search window coordinate corresponding to each second candidate area according to the optical flow time sequence characteristic corresponding to each second candidate area;

and determining the coordinates of the search window according to the confidence degrees corresponding to the second candidate areas and the coordinates of the candidate search window.

In one possible implementation, the search window update layer 130 includes a cross-correlation layer, a full-connection layer, and a screening layer, and when the region where the small target is located is screened out from each first candidate region:

the first convolution module layer is used for extracting a first feature map corresponding to each first candidate region;

the second convolution layer is used for calculating the optical flow time sequence characteristics corresponding to each first candidate area based on the first characteristic graph corresponding to each first candidate area;

the cross-correlation layer is used for performing point multiplication on the optical flow time sequence characteristics corresponding to the first candidate areas and the optical flow time sequence characteristics corresponding to the candidate areas cut out from the previous image frame to obtain candidate search window coordinates corresponding to the first candidate areas;

the full connection layer is used for calculating the confidence degree corresponding to each first candidate area based on the optical flow time sequence characteristics corresponding to each first candidate area;

and the screening layer is used for screening out the area where the small target is located based on the candidate search window coordinate and the confidence coefficient corresponding to each first candidate area.

In a possible implementation manner, the cross-correlation layer, when performing point multiplication on the optical flow time-series feature corresponding to each first candidate region and the optical flow time-series feature corresponding to the candidate region cut out from the previous image frame to obtain a candidate search window coordinate corresponding to each first candidate region, is specifically configured to: performing point multiplication on the optical flow time sequence characteristics corresponding to each first candidate area and the optical flow time sequence characteristics corresponding to the candidate areas cut out from the previous image frame to obtain a cross-correlation thermodynamic diagram corresponding to each first candidate area; and calculating candidate search window coordinates corresponding to each first candidate region according to the cross-correlation thermodynamic diagram corresponding to each first candidate region.

< apparatus embodiment >

Fig. 6 shows a schematic block diagram of a small target detection device based on the motion characteristics of multiple frames of images according to an embodiment of the present disclosure. As shown in fig. 6, the small object detection device 200 includes a processor 210 and a memory 220 for storing instructions executable by the processor 210. Wherein the processor 210 is configured to implement any of the foregoing small object detection methods when executing the executable instructions.

Here, it should be noted that the number of the processors 210 may be one or more. Meanwhile, in the small object detecting apparatus 200 of the embodiment of the present disclosure, an input device 230 and an output device 240 may be further included. The processor 210, the memory 220, the input device 230, and the output device 240 may be connected via a bus, or may be connected via other methods, which is not limited in detail herein.

The memory 220, which is a computer-readable storage medium, may be used to store software programs, computer-executable programs, and various modules, such as: the small target detection method of the embodiment of the disclosure corresponds to a program or a module. The processor 210 executes various functional applications and data processing of the small object detection apparatus 200 by executing software programs or modules stored in the memory 220.

The input device 230 may be used to receive an input number or signal. Wherein the signal may be a key signal generated in connection with user settings and function control of the device/terminal/server. The output device 240 may include a display device such as a display screen.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A small target detection method based on multi-frame image motion characteristics is characterized by comprising the following steps:

acquiring a current image frame from video stream data;

2. The method of claim 1, wherein updating the search window based on search window coordinates determined from a previous image frame chronologically adjacent to the current image frame comprises:

extracting a first feature map corresponding to each second candidate region;

3. The method according to claim 1, wherein when the previous image frame is an initial image frame, each second candidate region containing the small target is cropped from the previous image frame based on a preset initial search window.

4. The method of claim 3, wherein updating the search window based on search window coordinates determined from the initial image frame comprises:

extracting a second feature map corresponding to each second candidate region;

and determining the coordinates of the search window according to a second feature map corresponding to each second candidate region, and updating the search window for cutting the initial image frame according to the coordinates of the search window.

5. The method according to claim 2, wherein the optical flow temporal features corresponding to each of the second candidate regions are constructed based on a ConvLSTM neural network when the optical flow temporal features corresponding to each of the second candidate regions are calculated according to the first feature map corresponding to each of the second candidate regions.

6. The method according to claim 2, wherein determining the search window coordinates based on the optical-flow time-series feature corresponding to each of the second candidate regions includes:

7. The method according to any one of claims 2 to 6, wherein when the region where the small target is located is screened out from each of the first candidate regions, the method comprises:

extracting a first feature map corresponding to each first candidate region;

performing point multiplication on the optical flow time sequence characteristics corresponding to each first candidate area and the optical flow time sequence characteristics corresponding to the candidate areas cut out from the previous image frame to obtain candidate search window coordinates corresponding to each first candidate area;

8. The method according to claim 7, wherein the point-multiplying the optical flow time-series feature corresponding to each of the first candidate regions with the optical flow time-series feature corresponding to the candidate region cropped from the previous image frame to obtain the candidate search window coordinate corresponding to each of the first candidate regions comprises:

9. A small target detection device based on the motion characteristics of multiple frames of images is characterized by comprising:

10. A small target detection device based on the motion characteristics of multiple frames of images is characterized by comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to implement the method of any one of claims 1 to 8 when executing the executable instructions.