CN109190581B

CN109190581B - Image sequence target detection and identification method

Info

Publication number: CN109190581B
Application number: CN201811080439.4A
Authority: CN
Inventors: 龚如宾
Original assignee: Jinling Institute of Technology
Current assignee: Jinling Institute of Technology
Priority date: 2018-09-17
Filing date: 2018-09-17
Publication date: 2023-05-30
Anticipated expiration: 2038-09-17
Also published as: CN109190581A

Abstract

The invention belongs to the technical field of image detection and identification, and can be used for the fields of automatic driving, unmanned aerial vehicle visual target detection and identification, cartoon video detection and the like and related image sequence processing. Firstly, converting an input Image through motion compensation-based wrapping Image calculation (Image wrapping), inputting the wrapping Image into a depth convolution neural network to extract a depth convolution feature map, fusing the depth convolution feature maps of front and rear Image sequences on the basis, and finally using the fused depth convolution feature map for target positioning, target recognition or region recommendation and other target detection modules. The invention is mainly technically characterized by comprising a front-stage motion compensation-based winding image calculation module and a rear-stage fusion module for fusing depth convolution characteristic maps of front and rear image sequences. The invention can solve the problems of low detection precision and even incapability of detection caused by partial shielding of a single image, motion blurring and large deformation of a detection target.

Description

Image sequence target detection and identification method

Technical Field

The invention belongs to the technical field of image detection and identification, and particularly relates to a detection and identification method of an image sequence target.

Background

In the field of image understanding, the deep learning technology is widely used for image classification, image target detection, face recognition and other aspects, and a network model represented by Faster R-CNN obtains a good recognition effect in image target detection application. In image video application, the deep learning technology is widely applied to the fields of unmanned driving of automobiles, video monitoring, target detection and recognition of unmanned aerial vehicles and the like. The problems of low target detection precision or incapability of detection at all are often caused by the reasons of partial shielding of the target, blurring caused by too fast movement, large deformation of the target and the like in the scene. In the detection and identification of the cartoon video, low-level features cannot be stably extracted due to exaggeration of cartoon characters, and the cartoon images are drawn in different styles, so that understanding and identification of the cartoon images are also faced with great challenges.

The network model represented by Faster R-CNN is widely applied to single image target detection as shown in fig. 1, an input image is extracted into a convolution characteristic map through a convolution layer neural network, and then the convolution characteristic map is utilized for region recommendation, classification and position location, see U.S. patent reference [1]. However, when the method is deployed in an actual application environment, due to motion blur, image quality is reduced, a detection target is partially blocked, and the detection target is greatly deformed, so that partial information of an input image is lost, the input image is interfered by noise, and the target cannot be detected by using a traditional Faster R-CNN algorithm, so that different losses can be caused in the actual application.

The convolution feature map fusion technique is used in reference [2], but no motion compensation-based image transformation is performed on the input image. Although reference [2] performs motion compensation-based transformation on a convolution feature map, the motion compensation information is derived from an original input image sequence, the spatial domain dimension of the convolution feature map is generally reduced by more than 16 times compared with the original image dimension, the obtained pixel-based optical flow field cannot be accurately converted into an optical flow field on the convolution feature map, and an approximation method is used to convert a pixel-based motion vector into a motion vector between spatial domain units on the convolution feature map, so that a great amount of detailed motion information is lost. Motion compensation transformation of convolved feature maps by imprecise motion vectors affects the accuracy of detection and recognition, which is not analyzed in reference [2 ].

The invention expands the existing single image target detection algorithm based on the depth convolution neural network, expands the single image target detection algorithm into a video image sequence, obtains the accurate motion vector of the pixels in each image by calculating the optical flow field between the images in the image sequence, and can finely utilize the detail motion information to transform and correct the images. And calculating a convolution feature map for the transformed and corrected image, fusing all convolution feature maps on the basis, and finally carrying out target detection and identification by utilizing the fused convolution feature map. The method can solve the problem of insufficient single image information caused by shielding, motion blurring, large deformation of the target and the like, and improve the precision of target detection, thereby reducing economic and social losses.

Reference to the literature

1、Jian Sun,Beijing(CN)；Ross Girshick,Seattle,WA(US)；ShaoqingRen,Beijing(CN)；Kaiming He,Beijing(CN)，OBJECT DETECTION AND CLASSIFICATION IN MAGES,United States Patent Application Publication,Pub.No.:US2017/0206431A1

2、Xizhou Zhu,Yujie Wang,Jifeng Dai,Lu Yuan,Yichen Wei,Flow-Guided Feature Aggregation for Video Object Detection,The IEEE International Conference on Computer Vision(ICCV)2017.

3、A.Dosovitskiy et al.,"FlowNet:Learning Optical Flow with Convolutional Networks".IEEE International Conference on Computer Vision(ICCV),Santiago,Chile,pp.2758-2766,2016.

4、E.Ilg,N.Mayer,T.Saikia,M.Keuper,A.Dosovitskiy,T.Brox.FlowNet 2.0:Evolution of Optical Flow Estimation with Deep Networks.IEEE Conference in Computer Vision and Pattern Recognition(CVPR),2017.

Disclosure of Invention

In view of the situations that the target is detected with high precision under the conditions of motion blur, image quality reduction, partial shielding of a detection target, large deformation of the detection target and the like, the current target detection algorithm based on the single image deep convolution neural network cannot detect the target with high precision.

The invention comprises the following technical scheme: s1, inputting an image sequence to be processed, intercepting an image containing a target object, and calculating optical flow fields of a current frame and an adjacent frame of the image sequence; s2: calculating a winding image of the adjacent frame relative to the current frame according to the motion vector of the optical flow field; s3, respectively inputting the winding images of the current frame and the adjacent frames into a convolutional neural network to calculate a convolutional feature map; s4, after the convolution feature maps of the current frame and the adjacent frames are obtained, fusion of the convolution feature maps is carried out, and a fused convolution feature map is obtained; s5, target detection and identification are carried out according to the fused convolution characteristic map, and detection and identification results are output.

Advantageous effects

The invention is characterized in that:

(1) The motion compensation-based wrapped image is used as a depth convolution neural network input for extracting a depth convolution feature map;

(2) Fusing based on the depth convolution characteristic map;

the combination of the two methods can generate a more robust fusion convolution characteristic map, and the detection and recognition accuracy is improved.

Drawings

FIG. 1 is a schematic diagram of a single Faster R-CNN;

FIG. 2 timing expansion of Faster R-CNN;

FIG. 3 computes a wrapped image of an adjacent frame relative to a current frame based on motion vectors of an optical flow field;

FIG. 4 uses a wrapped image for feature extraction and fusion based on a depth convolution feature map;

FIG. 5 performs similarity calculation after transforming the convolution feature map;

FIG. 6 test example effect contrast;

FIG. 7 is a training module flow diagram;

FIG. 8 defines a detection network;

FIG. 9 is a flow chart for convolution feature map fusion.

Detailed Description

On the basis of early investigation, inspired from multi-obstacle detection of automatic driving, a single image target detection algorithm represented by Faster R-CNN is expanded, and detection is carried out on sequence images in video. The model architecture is shown in fig. 2, the lowest layer is each frame image of the video, the images are respectively input into a shared convolution network, a convolution Feature Map (Feature Map) of each frame is calculated, and then the convolution Feature Map (Feature Map) of the current frame and the convolution Feature maps (Feature maps) of a plurality of frames before and after are fused through a fusion algorithm. The generated fusion convolution Feature Map (Feature Map) is input into an area recommendation network (Region ProposalNetwork), and the RPN (Region Proposal Network) network slides on the fusion convolution Feature Map (Feature Map) by utilizing a sliding window to generate an area recommendation result. The RPN network output results comprise classification results and regression results, wherein the classification results divide the object into a foreground and a background, and the regression results mark possible positions of the object. In the training stage, the object area and the real object area of the regression layer are evaluated IoU (Intersection over Union), the foreground area larger than a certain threshold value is input RoIP (Region of Interest Pooling), the RoIP screens out high-quality features on a convolution Feature Map (Feature Map) according to the area recommendation, and finally the features are used for object classification and position regression.

Since the object in the image is in motion, the moving object is at different positions in the image at different times. If the ghost is formed at different positions by simple weighted sum fusion. The invention adopts a wrapping transformation (Image Warping) method based on motion compensation for an input Image to obtain a wrapping Image after motion compensation, and then inputs the wrapping Image into a convolutional neural network to obtain a convolutional feature map (FeatureMap). The specific implementation method is as follows:

step 1: inputting an image sequence to be processed, and calculating optical flow fields between adjacent frames and current frames of the image sequence:

when an object in space moves or a camera moves, the optical flow of the motion of the object or the motion of the camera is reflected on two continuous images taken. On a two-dimensional image, optical flow may be represented using 2D motion vectors, representing movement from a first frame to a second inter-frame point. The set of motion vectors for each pixel in the two images is called the optical flow field. The optical flow field has wide application in the fields of video image motion 3D restoration, image compression, image enhancement and the like. The adjacent frames of the image sequence refer to the previous K frames of the current frame and the next K frames of the current frame, 2K frames are taken as the adjacent frames, K is a positive integer, and the value range is [1,20].

In order to calculate a wrapped image from motion information of a target object, a deep convolutional neural network based technique FlowNet is used herein, see reference [3,4]To calculate an Optical Flow field (Optical Flow) between the current frame and the adjacent frame. For input image I _j And I _i The optical flow field between two images is predicted by equation (1), the specific architecture of the network and the literature [4 ] can be employed]The FlowNet2-s or FlowNet2, etc. Here θ is a parameter to be learned in a deep convolutional network, which is made to have the ability to predict the optical flow field by adjusting θ.

M _j→i ＝OpticalFlowNet(θ,I _j ,I _i ) (1)

Step 2: calculating a winding image of the adjacent frame relative to the current frame according to the motion vector of the optical flow field:

calculating optical flow field between adjacent images, M _j→i ＝OpticalFlowNet(I _j ,I _i ) To obtain image I _j Relative to image I _i Motion offsets between pixels of (a). Obtaining a wrapped Image by bilinear interpolation Image wrapping (Image wrapping) according to the motion offset between pixels, wherein the process is shown in formula (2), and BiLinearWarp is a bilinear interpolation for generating a wrapped Image (wrapped Image)

The calculation process of bilinear interpolation transformation is shown in fig. 3, for the predicted image I _pred According to the optical flow field M _j→i The corresponding point (x+Δx, y+Δy) in the original image I is obtained, which does not correspond to the real pixel, but is within the rectangle surrounded by the real pixel, as shown in fig. 3, and which belongs to the rectangle surrounded by 4 pixels Pa, pb, pc, pd. To calculate the color value of the point, a bilinear interpolation method may be used to calculate from the color values of the adjacent 4 pixels.

Although bilinear interpolation is used to calculate the wrapped image, spatial interpolation transformations such as bicubic interpolation by using nearest neighbor interpolation, lansos interpolation (lanczos) are also included within the scope of the present invention.

Step 3, inputting each winding image into a convolutional neural network to calculate a convolutional feature map:

and extracting a convolution characteristic Map of the wrapping image, calculating a wrapping image based on motion compensation of an adjacent frame relative to the current frame, and then inputting the wrapping image into a convolution neural network to calculate a convolution characteristic Map (Feature Map), wherein the convolution characteristic Map (Feature Map) is shown in a formula (3). Since the wrapping image uses the motion information provided by the optical flow field to perform motion compensation (Motion Compensation) on the pixels of the moving object in the image, compared with the original image input, the wrapping image is used for extracting the characteristics, and since the same moving object is generally positioned in the same position in the front frame image and the rear frame image through the calculation of the motion compensation, the calculated convolution characteristics are also generally positioned in the same position, thus laying a foundation for the later characteristic fusion.

Equation (3) for a wrapped Image (warp Image)

Extracting a convolutional feature map F by using a convolutional neural network ConvNet _j . On this basis, as shown in FIG. 4, a convolution feature map { F _j And (3) fusing, and calculating a fused convolution characteristic map.

Step 4, after the convolution feature maps of the current frame and the adjacent frames are obtained, fusion of the convolution feature maps is carried out, and the fused convolution feature maps are obtained:

different methods can be used for the fusion algorithm of the convolution feature map, and the fusion algorithm of two convolution features is listed below.

(1) Fusion algorithm 1 of convolution characteristic map

Calculating a fused convolution feature map by equation (4), here

Regarding the weight w _j Generally, the closer the distance from the current frame is, the greater the correlation between the image target and the target of the current frame is, the corresponding weight w _j The larger; the farther from the current frame, the smaller the target correlation between the image target and the current frame, the corresponding weight w _j The smaller. The weights w may be set here in accordance with the normalization of the Gaussian (0, σ) function _j Since the input wrapped image is subjected to motion compensation processing, the detection result is insensitive to the standard deviation value σ of the gaussian.

(2) Fusion algorithm 2 of convolution characteristic map

Each frame in the convolution feature fusion algorithm 1 adopts the same weight to carry out weighted summation, and the same weight is not suitable for the whole frame under the condition that the frame is blocked or has large deformation. For example, for a static background, the device is insensitive to the weight coefficient, but for an object with partial shielding in a frame, the shielded part in the frame should be set with a lower weight so as to make full use of the non-shielding object information of the previous and subsequent frames during fusion. The 2 nd fusion algorithm provides weights based on the convolution feature map unit level. Assume a convolution feature map F _j Is [ w, h, depth ]]Where w, h represents the width and height of the spatial domain of the convolution feature map (which may be 1/16 of the original image width and height for a Resnet-50 residual convolution network), depth represents the number of channels, where the convolution feature map unit refers to a vector of 1x1xdepth in a certain dimension on the spatial domain, e.g., the vector corresponding to position p in fig. 5. For the convolution feature map position p of the j-th frame, makeBy means of coefficients weightMap [ j ]][p]To set the weights. For different convolution characteristic map positions p, different weights weightMap [ j ] are set][p]To calculate a fused convolution feature map, where equation (5) is required to hold.

This embodiment is intended to be calculated by the following method. Assume a convolution feature map F _j Is [ w, h, depth ]]And (3) carrying out transformation processing through an embedded convolutional neural network. The embedded network structure is as follows:

(1) Performing dimension reduction processing through convolution transformation of 1x1 xnc;

(2) Then, performing convolution transformation of 3x3 xnc;

(3) Finally, the convolution transformation of 1x1xdepth is carried out to restore to the original dimension.

Where nc represents the number of channels of the embedded network with dimension reduction, and nc takes a value of 2 ⁿ (e.g. 512) so that the weights of the embedded network can be learned by the learning algorithm of the deep neural network, the embedded network can be used for the original convolution characteristic map F _j Transforming to obtain

So that the transformed convolution feature map +.>

The method is more suitable for measuring the similarity between convolution characteristic map units of adjacent frames, and the similarity is recorded as SimMap [ j ]; i.e]As shown in fig. 5.

The similarity is defined here according to the following principle: 1) If the two units come from the same part of the same target, the two units are transmitted to the camera to have higher color similarity and also have similar convolution characteristic values, and higher similarity values are given; if the two units come from different targets or different parts of the same target, the colors generally have larger differences, the convolution characteristic values also have larger differences, and a lower similarity value is given; 2) The similarity value has symmetry. The invention does not limit the similarity calculation method, and can be adopted as long as the similarity principle is met.

The similarity value calculation is implemented, and can be measured by using discrete cosine distance between two vectors, euclidean distance, pearson correlation, minkowski distance and the like. The measure of similarity includes, but is not limited to, the measure method described above. For example, in the case of j=i, the similarity SimMap [ j; the i ] [ p ] value should take the maximum value, if discrete cosine distance is used, the similarity SimMap [ i; i ] [ p ] =1. If some measure of distance is taken to be smaller and the similarity is greater, such as for Euclidean distance and Minkowski distance, the opposite numbers can be taken for subsequent calculations.

After computing the SimMap, for each position p, the SimMap similarity value can be normalized between [0,1] using a SoftMax function, resulting in a weightMap, as shown in equation (6):

weightMap[j][p]＝SoftMax({SimMap[j；i][p]}),i-k≤j≤i+k (6)

after the weight map is calculated, the fused convolution feature map can be calculated by the weight map, as shown in formula (7).

The convolution characteristic map calculated by the method has lower weight coefficient given to the convolution characteristic map units from different object image areas caused by partial shielding and the like, and has lower contribution to the fused convolution characteristic map, so that the convolution characteristic map is less influenced by shielding, large deformation of a target and the like.

Step 5: target detection and recognition are carried out according to the fused convolution feature map, and detection and recognition results are output, so that the target detection can be realized by using the convolution feature map by adopting a traditional single image, such as a region recommendation network of Faster R-CNN in reference [1], a position frame regression network and a category classification network. Detection may also be performed using a target detection implementation similar to that in SSD (single shot multibox detector).

Detection algorithm summary based on fusion convolution feature map

The following summarises the detection algorithm based on the fused convolution feature map, as shown in algorithm 1. In algorithm 1, i represents the current processing frame number, K represents the range of the preceding and following adjacent frames, and j represents a certain adjacent frame. Algorithm 1 first calculates a wrapped Image (warp Image) of the neighboring frame with respect to the current frame. A convolution Feature Map (Feature Map) of each wrapped image is then calculated by inputting the wrapped image into a convolution neural network. In order to solve the problems of motion blur, large deformation, shielding and the like, a more robust convolution feature map is obtained by fusing the feature maps of the previous and subsequent frames. Specifically, the feature maps of the frames are weighted and accumulated to obtain a fused convolution feature map (Fusion Feature Map). The fusion process may be performed in accordance with a convolution feature map unit. And taking the fused characteristic map as an input of a regional recommendation network (RegionProposal Network) to obtain a ROI (Region of Interest) region. And finally, taking the ROI area as input of a classification network and a position regression network, and finally obtaining a detection result. The pseudo code of the detection and identification algorithm of the present invention is shown below.

Training algorithm summary based on fusion convolution feature map

The training algorithm based on Feature Map fusion is differentiable, so the architecture can be trained in an end-to-end fashion. Wherein as shown in equation (2), it is differentially conductive, both for the image pixel itself and for the optical flow field. Equation (3) is derivative-derivable for pixels of the wrapped image of each frame. Equation (4) is derivative-derivable for a convolution Feature Map (Feature Map). In the training phase, the number of the adjacent frames is set to a smaller value (e.g., ktrain=2) due to the memory of the display card, etc. During the training phase, the Ktrain frames can be randomly sampled from adjacent frames in a larger range (such as K=5) before and after the training, and a dropout method similar to that in the neural network training is adopted. The method specifically comprises the following steps:

1. preparing a dataset

The data of the training module is selected and manually collected, the manually collected data set is divided into the following steps of selecting a video, intercepting an image containing a target, marking the target position, and storing the image, the target position and the target name information into a text file.

2. Training module design

The overall process flow chart of the training module is shown in fig. 7, and the training is mainly performed in the following 6 steps.

(1) Network parameters are configured. Since the network parameters are relatively large and do not vary much, a configuration class can be defined. This class is then stored in a file for easy migration and reuse, avoiding the cumbersome task of defining the same parameters repeatedly.

(2) The data set is preprocessed. The information required for object detection is relatively large, and not only the category information of the object but also the position information of the object are required. In order to train the information once input into the network, the subject uses the python script to further process the processed image and label. The path of the image, the position of the target (bounding box of the target) to be marked and the name of the target are extracted into a text file, and all information of the data set can be obtained by only reading the file in a training module. In addition, because training is required while training, it is necessary to divide the data set into a training set and a verification set.

(3) A network is defined. The network is the basis of the model, and all the networks are defined well on the basis of the realization of the winding image calculation module and the feature fusion module. As shown in fig. 8, the optical flow field network and the winding image calculation module are defined in sequence, shared convolution networks shared_layers are defined, fusion algorithms of depth convolution feature maps are defined, an area recommendation network RPN is defined, and a full connection network is defined for classification and frame regression.

(4) And (5) constructing a model. After the network is present, keras can be used to build a network architecture model and compile the model in preparation for training.

(5) Training. The model training is carried out for a total of epoch times, a certain amount of training data is iterated each time, the training strategy can adopt four-step alternate iterative training similar to the Faster R-CNN training, and the training can also be carried out by using an end-to-end method.

(6) And (5) saving the training result. The end of training will produce a lot of experimental data where the weight model file is critical, so the training process needs to save the weights into the file. Secondly, the accuracy and other data of the verification set in the training can help the experiment to analyze the training process, evaluate the training result and provide reference for improving the network and further researching the algorithm.

Model configuration

The model configuration needs to define some model parameters in advance, such as whether training enlarges the data set by means of translation, rotation, etc., the number of RoI, where the trained weight file is stored, the name of the configuration file, etc. Model configuration requires the definition of a large number of attributes, defining a configuration file that specifically records these parameters.

Defining a loss function

The loss function is the basis of training optimization, and the reasonable model function is defined, so that the calculated amount in the training process can be greatly reduced, and a better training result can be obtained. Four loss functions, namely an RPN regression loss function, an RPN classification loss function, a full-connection network regression loss function and a full-connection network classification loss function, are designed for training at different stages.

Network architecture

The architecture of the detection network comprises five parts, a coiled image calculation part, a shared convolutional neural network, a deep convolutional feature fusion part, a region recommendation part and a full connection regression and classification network part. The depth convolution feature map fusion part generates a depth convolution feature map by fusing all the winding images, and a more robust depth convolution feature map is obtained, and is less influenced by shielding, large deformation of a target, motion blurring and the like compared with a convolution feature map extracted by FasterR-CNN based on a single image. Compared with the method in the document [2], the method adds a winding image calculation part, and the generated convolution characteristic map reflects the motion information of the target object more accurately, so that a depth convolution characteristic map with higher quality can be obtained.

3. Detection module design

The module mainly detects the video or image uploaded by the user by using the trained weight file, and detects and marks the name and the position of the target object, and the main steps are as follows:

(1) Video or pictures are read in. And respectively preparing a video and picture test set, and inputting videos and pictures containing various target objects can be detected.

(2) Loading model and configuration information. The model is a model which can be used for detection and is trained by the training module, only the model is needed to be loaded from the file, and in addition, the configuration information is basically regulated by the training module, only the model is needed to be loaded from the file.

(3) A network is defined. All networks need to be defined first, based on the definition of the network in fig. 8.

The method comprises the steps of defining an optical flow field network and a winding image calculation module in sequence, defining shared convolution networks shared_layers, defining a fusion algorithm of a depth convolution characteristic map, defining an area recommendation network RPN, and finally defining a fully connected network for a classification network and a frame regression network.

(4) It is determined whether a picture or a video is input. If the picture is the picture, directly inputting the picture into a detection network for detection, and obtaining a detection result; if the video is the video, the OpenCV is used for converting the video into pictures of a frame, the pictures are respectively input into the network for detection, and the detected pictures are converted into the video.

Convolution Feature Map (Feature Map) fusion module design

According to the above convolution feature map fusion algorithm, as shown in fig. 9, a fusion module of the convolution feature map is designed. Firstly, calculating an optical flow field between an adjacent frame and a current frame, finding pixel motion vectors of the adjacent frame and the current frame through the optical flow field, then, calculating a winding image by the adjacent frame according to the motion vectors of the optical flow field, inputting the winding image into a convolutional neural network on the basis, extracting a convolutional feature map, and finally, fusing the convolutional feature maps of a plurality of frames before and after the fusion, wherein a fusion specific algorithm can be realized by referring to a fusion algorithm 1 of the convolutional feature map or a fusion algorithm 2 of the convolutional feature map. The final convolution characteristic map combines the target global information of the front and rear multiframes, and the problems that the detection accuracy is reduced or the target cannot be detected at all due to shielding, deformation and motion blurring can be solved.

The technical proposal of the invention has the beneficial effects that

Through testing on the cartoon image, the detection precision is generally improved on the cartoon image with larger deformation by using a time sequence expansion method. As shown in fig. 6, the left graph of the detection accuracy is the result (63%) of the fast R-CNN detection of mickey_mouse, the right graph is the result (84%) of the detection result of the time-series fast R-CNN fusion of mickey_mouse, and in this image, the cartoon character is greatly deformed compared with the preceding and following frames, and it can be seen that the detection accuracy is improved from 63% to 84% using the target detection method based on the time-series expansion.

Claims

1. A method for detecting and identifying an image sequence object, the method comprising the steps of:

s1, inputting an image sequence to be processed, intercepting an image containing a target object, and calculating optical flow fields of a current frame and an adjacent frame of the image sequence;

s2: calculating a winding image of the adjacent frame relative to the current frame according to the motion vector of the optical flow field:

the optical flow field between adjacent images is calculated,

to obtain an image->

Relative to the image

Motion offsets between pixels of (a); the wrapped image is obtained by a bilinear interpolation image wrapping transform based on the motion offset between pixels, as shown in equation (2), where BiLinearWarp is a bilinear interpolation transform used to generate the transformed wrapped image>

；

(2)；

Wherein: the bilinear interpolation may be replaced by either nearest neighbor interpolation or lansos interpolation or bicubic interpolation;

s3, respectively inputting the winding images of the current frame and the adjacent frames into a convolutional neural network to calculate a convolutional feature map;

s4, after the convolution feature maps of the current frame and the adjacent frames are obtained, fusion of the convolution feature maps is carried out, and the fused convolution feature maps are obtained:

s5, target detection and identification are carried out according to the fused convolution characteristic map, and detection and identification results are output.

2. The method for detecting and identifying an image sequence object according to claim 1, wherein: the adjacent frames of the image sequence refer to the previous K frames of the current frame and the next K frames of the current frame, 2K frames are taken as the adjacent frames, K is a positive integer, and the value range is [1,20].

3. The method for detecting and identifying an image sequence object according to claim 1, wherein: the fusion algorithm of the convolution feature map in step S4 includes two kinds, one of which is: the fused convolution characteristic map is calculated through a formula (4) and needs to be satisfied

The method comprises the steps of carrying out a first treatment on the surface of the About weight +.>

Is set according to the normalized Gaussian function +.>

To set the weight +.>

：

（4）

K is a positive integer, and the value range is 1,20]；F _j Is a convolution feature map;

is the standard deviation value.

4. The method for detecting and identifying an image sequence object according to claim 1, wherein: the second fusion algorithm of the convolution feature map in step S4 is that the 2 nd fusion algorithm provides weights based on the unit level of the convolution feature map, and sets the convolution feature map F _j Is [ w, h, depth ]]Where w, h represents the width and height of the spatial domain of the convolution feature map, depth represents the number of channels, where the convolution feature map element refers to a vector of a certain 1x1xdepth dimension over the spatial domain; for the convolved feature map position p of the j-th frame, the coefficient weightMap [ j ] is used][p]To set weights; to map position p for different convolution characteristics, different weights weightMap [ j ] are set][p]To calculate a fused convolution feature map, where equation (5) holds:

（5）

k is a positive integer, and the value range is [1,20];

convolved feature map F _j Is [ w, h, depth ]]The tensors of (2) are transformed through an embedded convolutional neural network; the embedded network structure is as follows:

(2) Then, performing convolution transformation of 3x3 xnc;

(3) Finally, carrying out convolution transformation of 1x1xdepth to restore to the original dimension;

where nc represents the number of channels of the embedded network with dimension reduction, and the value is 2 ⁿ Thus, the weight of the embedded network can be learned through the learning algorithm of the deep neural network, and the embedded network can map the original convolution characteristics

Transforming to obtain

So that the transformed convolution feature map +.>

The method is more suitable for measuring the similarity between convolution characteristic map units of adjacent frames; the similarity between the convolution feature maps of the adjacent frame j and the current frame i is marked as SimMap [ j; i]The similarity calculation may be measured using a discrete cosine distance between the two vectors, a pearson correlation, an inverse euclidean distance, an inverse minkowski distance;

，

(6)

after the weight map is calculated, the fused convolution feature map can be calculated by the weight map, as shown in formula (7):

(7)。/>