CN109190581B - Image sequence target detection and identification method - Google Patents

Image sequence target detection and identification method Download PDF

Info

Publication number
CN109190581B
CN109190581B CN201811080439.4A CN201811080439A CN109190581B CN 109190581 B CN109190581 B CN 109190581B CN 201811080439 A CN201811080439 A CN 201811080439A CN 109190581 B CN109190581 B CN 109190581B
Authority
CN
China
Prior art keywords
image
convolution
feature map
convolution feature
map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811080439.4A
Other languages
Chinese (zh)
Other versions
CN109190581A (en
Inventor
龚如宾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinling Institute of Technology
Original Assignee
Jinling Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinling Institute of Technology filed Critical Jinling Institute of Technology
Priority to CN201811080439.4A priority Critical patent/CN109190581B/en
Publication of CN109190581A publication Critical patent/CN109190581A/en
Application granted granted Critical
Publication of CN109190581B publication Critical patent/CN109190581B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of image detection and identification, and can be used for the fields of automatic driving, unmanned aerial vehicle visual target detection and identification, cartoon video detection and the like and related image sequence processing. Firstly, converting an input Image through motion compensation-based wrapping Image calculation (Image wrapping), inputting the wrapping Image into a depth convolution neural network to extract a depth convolution feature map, fusing the depth convolution feature maps of front and rear Image sequences on the basis, and finally using the fused depth convolution feature map for target positioning, target recognition or region recommendation and other target detection modules. The invention is mainly technically characterized by comprising a front-stage motion compensation-based winding image calculation module and a rear-stage fusion module for fusing depth convolution characteristic maps of front and rear image sequences. The invention can solve the problems of low detection precision and even incapability of detection caused by partial shielding of a single image, motion blurring and large deformation of a detection target.

Description

Image sequence target detection and identification method
Technical Field
The invention belongs to the technical field of image detection and identification, and particularly relates to a detection and identification method of an image sequence target.
Background
In the field of image understanding, the deep learning technology is widely used for image classification, image target detection, face recognition and other aspects, and a network model represented by Faster R-CNN obtains a good recognition effect in image target detection application. In image video application, the deep learning technology is widely applied to the fields of unmanned driving of automobiles, video monitoring, target detection and recognition of unmanned aerial vehicles and the like. The problems of low target detection precision or incapability of detection at all are often caused by the reasons of partial shielding of the target, blurring caused by too fast movement, large deformation of the target and the like in the scene. In the detection and identification of the cartoon video, low-level features cannot be stably extracted due to exaggeration of cartoon characters, and the cartoon images are drawn in different styles, so that understanding and identification of the cartoon images are also faced with great challenges.
The network model represented by Faster R-CNN is widely applied to single image target detection as shown in fig. 1, an input image is extracted into a convolution characteristic map through a convolution layer neural network, and then the convolution characteristic map is utilized for region recommendation, classification and position location, see U.S. patent reference [1]. However, when the method is deployed in an actual application environment, due to motion blur, image quality is reduced, a detection target is partially blocked, and the detection target is greatly deformed, so that partial information of an input image is lost, the input image is interfered by noise, and the target cannot be detected by using a traditional Faster R-CNN algorithm, so that different losses can be caused in the actual application.
The convolution feature map fusion technique is used in reference [2], but no motion compensation-based image transformation is performed on the input image. Although reference [2] performs motion compensation-based transformation on a convolution feature map, the motion compensation information is derived from an original input image sequence, the spatial domain dimension of the convolution feature map is generally reduced by more than 16 times compared with the original image dimension, the obtained pixel-based optical flow field cannot be accurately converted into an optical flow field on the convolution feature map, and an approximation method is used to convert a pixel-based motion vector into a motion vector between spatial domain units on the convolution feature map, so that a great amount of detailed motion information is lost. Motion compensation transformation of convolved feature maps by imprecise motion vectors affects the accuracy of detection and recognition, which is not analyzed in reference [2 ].
The invention expands the existing single image target detection algorithm based on the depth convolution neural network, expands the single image target detection algorithm into a video image sequence, obtains the accurate motion vector of the pixels in each image by calculating the optical flow field between the images in the image sequence, and can finely utilize the detail motion information to transform and correct the images. And calculating a convolution feature map for the transformed and corrected image, fusing all convolution feature maps on the basis, and finally carrying out target detection and identification by utilizing the fused convolution feature map. The method can solve the problem of insufficient single image information caused by shielding, motion blurring, large deformation of the target and the like, and improve the precision of target detection, thereby reducing economic and social losses.
Reference to the literature
1、Jian Sun,Beijing(CN);Ross Girshick,Seattle,WA(US);ShaoqingRen,Beijing(CN);Kaiming He,Beijing(CN),OBJECT DETECTION AND CLASSIFICATION IN MAGES,United States Patent Application Publication,Pub.No.:US2017/0206431A1
2、Xizhou Zhu,Yujie Wang,Jifeng Dai,Lu Yuan,Yichen Wei,Flow-Guided Feature Aggregation for Video Object Detection,The IEEE International Conference on Computer Vision(ICCV)2017.
3、A.Dosovitskiy et al.,"FlowNet:Learning Optical Flow with Convolutional Networks".IEEE International Conference on Computer Vision(ICCV),Santiago,Chile,pp.2758-2766,2016.
4、E.Ilg,N.Mayer,T.Saikia,M.Keuper,A.Dosovitskiy,T.Brox.FlowNet 2.0:Evolution of Optical Flow Estimation with Deep Networks.IEEE Conference in Computer Vision and Pattern Recognition(CVPR),2017.
Disclosure of Invention
In view of the situations that the target is detected with high precision under the conditions of motion blur, image quality reduction, partial shielding of a detection target, large deformation of the detection target and the like, the current target detection algorithm based on the single image deep convolution neural network cannot detect the target with high precision.
The invention comprises the following technical scheme: s1, inputting an image sequence to be processed, intercepting an image containing a target object, and calculating optical flow fields of a current frame and an adjacent frame of the image sequence; s2: calculating a winding image of the adjacent frame relative to the current frame according to the motion vector of the optical flow field; s3, respectively inputting the winding images of the current frame and the adjacent frames into a convolutional neural network to calculate a convolutional feature map; s4, after the convolution feature maps of the current frame and the adjacent frames are obtained, fusion of the convolution feature maps is carried out, and a fused convolution feature map is obtained; s5, target detection and identification are carried out according to the fused convolution characteristic map, and detection and identification results are output.
Advantageous effects
The invention is characterized in that:
(1) The motion compensation-based wrapped image is used as a depth convolution neural network input for extracting a depth convolution feature map;
(2) Fusing based on the depth convolution characteristic map;
the combination of the two methods can generate a more robust fusion convolution characteristic map, and the detection and recognition accuracy is improved.
Drawings
FIG. 1 is a schematic diagram of a single Faster R-CNN;
FIG. 2 timing expansion of Faster R-CNN;
FIG. 3 computes a wrapped image of an adjacent frame relative to a current frame based on motion vectors of an optical flow field;
FIG. 4 uses a wrapped image for feature extraction and fusion based on a depth convolution feature map;
FIG. 5 performs similarity calculation after transforming the convolution feature map;
FIG. 6 test example effect contrast;
FIG. 7 is a training module flow diagram;
FIG. 8 defines a detection network;
FIG. 9 is a flow chart for convolution feature map fusion.
Detailed Description
On the basis of early investigation, inspired from multi-obstacle detection of automatic driving, a single image target detection algorithm represented by Faster R-CNN is expanded, and detection is carried out on sequence images in video. The model architecture is shown in fig. 2, the lowest layer is each frame image of the video, the images are respectively input into a shared convolution network, a convolution Feature Map (Feature Map) of each frame is calculated, and then the convolution Feature Map (Feature Map) of the current frame and the convolution Feature maps (Feature maps) of a plurality of frames before and after are fused through a fusion algorithm. The generated fusion convolution Feature Map (Feature Map) is input into an area recommendation network (Region ProposalNetwork), and the RPN (Region Proposal Network) network slides on the fusion convolution Feature Map (Feature Map) by utilizing a sliding window to generate an area recommendation result. The RPN network output results comprise classification results and regression results, wherein the classification results divide the object into a foreground and a background, and the regression results mark possible positions of the object. In the training stage, the object area and the real object area of the regression layer are evaluated IoU (Intersection over Union), the foreground area larger than a certain threshold value is input RoIP (Region of Interest Pooling), the RoIP screens out high-quality features on a convolution Feature Map (Feature Map) according to the area recommendation, and finally the features are used for object classification and position regression.
Since the object in the image is in motion, the moving object is at different positions in the image at different times. If the ghost is formed at different positions by simple weighted sum fusion. The invention adopts a wrapping transformation (Image Warping) method based on motion compensation for an input Image to obtain a wrapping Image after motion compensation, and then inputs the wrapping Image into a convolutional neural network to obtain a convolutional feature map (FeatureMap). The specific implementation method is as follows:
step 1: inputting an image sequence to be processed, and calculating optical flow fields between adjacent frames and current frames of the image sequence:
when an object in space moves or a camera moves, the optical flow of the motion of the object or the motion of the camera is reflected on two continuous images taken. On a two-dimensional image, optical flow may be represented using 2D motion vectors, representing movement from a first frame to a second inter-frame point. The set of motion vectors for each pixel in the two images is called the optical flow field. The optical flow field has wide application in the fields of video image motion 3D restoration, image compression, image enhancement and the like. The adjacent frames of the image sequence refer to the previous K frames of the current frame and the next K frames of the current frame, 2K frames are taken as the adjacent frames, K is a positive integer, and the value range is [1,20].
In order to calculate a wrapped image from motion information of a target object, a deep convolutional neural network based technique FlowNet is used herein, see reference [3,4]To calculate an Optical Flow field (Optical Flow) between the current frame and the adjacent frame. For input image I j And I i The optical flow field between two images is predicted by equation (1), the specific architecture of the network and the literature [4 ] can be employed]The FlowNet2-s or FlowNet2, etc. Here θ is a parameter to be learned in a deep convolutional network, which is made to have the ability to predict the optical flow field by adjusting θ.
M j→i =OpticalFlowNet(θ,I j ,I i ) (1)
Step 2: calculating a winding image of the adjacent frame relative to the current frame according to the motion vector of the optical flow field:
calculating optical flow field between adjacent images, M j→i =OpticalFlowNet(I j ,I i ) To obtain image I j Relative to image I i Motion offsets between pixels of (a). Obtaining a wrapped Image by bilinear interpolation Image wrapping (Image wrapping) according to the motion offset between pixels, wherein the process is shown in formula (2), and BiLinearWarp is a bilinear interpolation for generating a wrapped Image (wrapped Image)
Figure BDA0001801805430000041
The calculation process of bilinear interpolation transformation is shown in fig. 3, for the predicted image I pred According to the optical flow field M j→i The corresponding point (x+Δx, y+Δy) in the original image I is obtained, which does not correspond to the real pixel, but is within the rectangle surrounded by the real pixel, as shown in fig. 3, and which belongs to the rectangle surrounded by 4 pixels Pa, pb, pc, pd. To calculate the color value of the point, a bilinear interpolation method may be used to calculate from the color values of the adjacent 4 pixels.
Figure BDA0001801805430000051
Although bilinear interpolation is used to calculate the wrapped image, spatial interpolation transformations such as bicubic interpolation by using nearest neighbor interpolation, lansos interpolation (lanczos) are also included within the scope of the present invention.
Step 3, inputting each winding image into a convolutional neural network to calculate a convolutional feature map:
and extracting a convolution characteristic Map of the wrapping image, calculating a wrapping image based on motion compensation of an adjacent frame relative to the current frame, and then inputting the wrapping image into a convolution neural network to calculate a convolution characteristic Map (Feature Map), wherein the convolution characteristic Map (Feature Map) is shown in a formula (3). Since the wrapping image uses the motion information provided by the optical flow field to perform motion compensation (Motion Compensation) on the pixels of the moving object in the image, compared with the original image input, the wrapping image is used for extracting the characteristics, and since the same moving object is generally positioned in the same position in the front frame image and the rear frame image through the calculation of the motion compensation, the calculated convolution characteristics are also generally positioned in the same position, thus laying a foundation for the later characteristic fusion.
Figure BDA0001801805430000052
Equation (3) for a wrapped Image (warp Image)
Figure BDA0001801805430000053
Extracting a convolutional feature map F by using a convolutional neural network ConvNet j . On this basis, as shown in FIG. 4, a convolution feature map { F j And (3) fusing, and calculating a fused convolution characteristic map.
Step 4, after the convolution feature maps of the current frame and the adjacent frames are obtained, fusion of the convolution feature maps is carried out, and the fused convolution feature maps are obtained:
different methods can be used for the fusion algorithm of the convolution feature map, and the fusion algorithm of two convolution features is listed below.
(1) Fusion algorithm 1 of convolution characteristic map
Calculating a fused convolution feature map by equation (4), here
Figure BDA0001801805430000054
Regarding the weight w j Generally, the closer the distance from the current frame is, the greater the correlation between the image target and the target of the current frame is, the corresponding weight w j The larger; the farther from the current frame, the smaller the target correlation between the image target and the current frame, the corresponding weight w j The smaller. The weights w may be set here in accordance with the normalization of the Gaussian (0, σ) function j Since the input wrapped image is subjected to motion compensation processing, the detection result is insensitive to the standard deviation value σ of the gaussian.
Figure BDA0001801805430000061
(2) Fusion algorithm 2 of convolution characteristic map
Each frame in the convolution feature fusion algorithm 1 adopts the same weight to carry out weighted summation, and the same weight is not suitable for the whole frame under the condition that the frame is blocked or has large deformation. For example, for a static background, the device is insensitive to the weight coefficient, but for an object with partial shielding in a frame, the shielded part in the frame should be set with a lower weight so as to make full use of the non-shielding object information of the previous and subsequent frames during fusion. The 2 nd fusion algorithm provides weights based on the convolution feature map unit level. Assume a convolution feature map F j Is [ w, h, depth ]]Where w, h represents the width and height of the spatial domain of the convolution feature map (which may be 1/16 of the original image width and height for a Resnet-50 residual convolution network), depth represents the number of channels, where the convolution feature map unit refers to a vector of 1x1xdepth in a certain dimension on the spatial domain, e.g., the vector corresponding to position p in fig. 5. For the convolution feature map position p of the j-th frame, makeBy means of coefficients weightMap [ j ]][p]To set the weights. For different convolution characteristic map positions p, different weights weightMap [ j ] are set][p]To calculate a fused convolution feature map, where equation (5) is required to hold.
Figure BDA0001801805430000062
This embodiment is intended to be calculated by the following method. Assume a convolution feature map F j Is [ w, h, depth ]]And (3) carrying out transformation processing through an embedded convolutional neural network. The embedded network structure is as follows:
(1) Performing dimension reduction processing through convolution transformation of 1x1 xnc;
(2) Then, performing convolution transformation of 3x3 xnc;
(3) Finally, the convolution transformation of 1x1xdepth is carried out to restore to the original dimension.
Where nc represents the number of channels of the embedded network with dimension reduction, and nc takes a value of 2 n (e.g. 512) so that the weights of the embedded network can be learned by the learning algorithm of the deep neural network, the embedded network can be used for the original convolution characteristic map F j Transforming to obtain
Figure BDA0001801805430000063
So that the transformed convolution feature map +.>
Figure BDA0001801805430000064
The method is more suitable for measuring the similarity between convolution characteristic map units of adjacent frames, and the similarity is recorded as SimMap [ j ]; i.e]As shown in fig. 5.
The similarity is defined here according to the following principle: 1) If the two units come from the same part of the same target, the two units are transmitted to the camera to have higher color similarity and also have similar convolution characteristic values, and higher similarity values are given; if the two units come from different targets or different parts of the same target, the colors generally have larger differences, the convolution characteristic values also have larger differences, and a lower similarity value is given; 2) The similarity value has symmetry. The invention does not limit the similarity calculation method, and can be adopted as long as the similarity principle is met.
The similarity value calculation is implemented, and can be measured by using discrete cosine distance between two vectors, euclidean distance, pearson correlation, minkowski distance and the like. The measure of similarity includes, but is not limited to, the measure method described above. For example, in the case of j=i, the similarity SimMap [ j; the i ] [ p ] value should take the maximum value, if discrete cosine distance is used, the similarity SimMap [ i; i ] [ p ] =1. If some measure of distance is taken to be smaller and the similarity is greater, such as for Euclidean distance and Minkowski distance, the opposite numbers can be taken for subsequent calculations.
After computing the SimMap, for each position p, the SimMap similarity value can be normalized between [0,1] using a SoftMax function, resulting in a weightMap, as shown in equation (6):
weightMap[j][p]=SoftMax({SimMap[j;i][p]}),i-k≤j≤i+k (6)
after the weight map is calculated, the fused convolution feature map can be calculated by the weight map, as shown in formula (7).
Figure BDA0001801805430000071
The convolution characteristic map calculated by the method has lower weight coefficient given to the convolution characteristic map units from different object image areas caused by partial shielding and the like, and has lower contribution to the fused convolution characteristic map, so that the convolution characteristic map is less influenced by shielding, large deformation of a target and the like.
Step 5: target detection and recognition are carried out according to the fused convolution feature map, and detection and recognition results are output, so that the target detection can be realized by using the convolution feature map by adopting a traditional single image, such as a region recommendation network of Faster R-CNN in reference [1], a position frame regression network and a category classification network. Detection may also be performed using a target detection implementation similar to that in SSD (single shot multibox detector).
Detection algorithm summary based on fusion convolution feature map
The following summarises the detection algorithm based on the fused convolution feature map, as shown in algorithm 1. In algorithm 1, i represents the current processing frame number, K represents the range of the preceding and following adjacent frames, and j represents a certain adjacent frame. Algorithm 1 first calculates a wrapped Image (warp Image) of the neighboring frame with respect to the current frame. A convolution Feature Map (Feature Map) of each wrapped image is then calculated by inputting the wrapped image into a convolution neural network. In order to solve the problems of motion blur, large deformation, shielding and the like, a more robust convolution feature map is obtained by fusing the feature maps of the previous and subsequent frames. Specifically, the feature maps of the frames are weighted and accumulated to obtain a fused convolution feature map (Fusion Feature Map). The fusion process may be performed in accordance with a convolution feature map unit. And taking the fused characteristic map as an input of a regional recommendation network (RegionProposal Network) to obtain a ROI (Region of Interest) region. And finally, taking the ROI area as input of a classification network and a position regression network, and finally obtaining a detection result. The pseudo code of the detection and identification algorithm of the present invention is shown below.
Figure BDA0001801805430000081
Training algorithm summary based on fusion convolution feature map
The training algorithm based on Feature Map fusion is differentiable, so the architecture can be trained in an end-to-end fashion. Wherein as shown in equation (2), it is differentially conductive, both for the image pixel itself and for the optical flow field. Equation (3) is derivative-derivable for pixels of the wrapped image of each frame. Equation (4) is derivative-derivable for a convolution Feature Map (Feature Map). In the training phase, the number of the adjacent frames is set to a smaller value (e.g., ktrain=2) due to the memory of the display card, etc. During the training phase, the Ktrain frames can be randomly sampled from adjacent frames in a larger range (such as K=5) before and after the training, and a dropout method similar to that in the neural network training is adopted. The method specifically comprises the following steps:
1. preparing a dataset
The data of the training module is selected and manually collected, the manually collected data set is divided into the following steps of selecting a video, intercepting an image containing a target, marking the target position, and storing the image, the target position and the target name information into a text file.
2. Training module design
The overall process flow chart of the training module is shown in fig. 7, and the training is mainly performed in the following 6 steps.
(1) Network parameters are configured. Since the network parameters are relatively large and do not vary much, a configuration class can be defined. This class is then stored in a file for easy migration and reuse, avoiding the cumbersome task of defining the same parameters repeatedly.
(2) The data set is preprocessed. The information required for object detection is relatively large, and not only the category information of the object but also the position information of the object are required. In order to train the information once input into the network, the subject uses the python script to further process the processed image and label. The path of the image, the position of the target (bounding box of the target) to be marked and the name of the target are extracted into a text file, and all information of the data set can be obtained by only reading the file in a training module. In addition, because training is required while training, it is necessary to divide the data set into a training set and a verification set.
(3) A network is defined. The network is the basis of the model, and all the networks are defined well on the basis of the realization of the winding image calculation module and the feature fusion module. As shown in fig. 8, the optical flow field network and the winding image calculation module are defined in sequence, shared convolution networks shared_layers are defined, fusion algorithms of depth convolution feature maps are defined, an area recommendation network RPN is defined, and a full connection network is defined for classification and frame regression.
(4) And (5) constructing a model. After the network is present, keras can be used to build a network architecture model and compile the model in preparation for training.
(5) Training. The model training is carried out for a total of epoch times, a certain amount of training data is iterated each time, the training strategy can adopt four-step alternate iterative training similar to the Faster R-CNN training, and the training can also be carried out by using an end-to-end method.
(6) And (5) saving the training result. The end of training will produce a lot of experimental data where the weight model file is critical, so the training process needs to save the weights into the file. Secondly, the accuracy and other data of the verification set in the training can help the experiment to analyze the training process, evaluate the training result and provide reference for improving the network and further researching the algorithm.
Model configuration
The model configuration needs to define some model parameters in advance, such as whether training enlarges the data set by means of translation, rotation, etc., the number of RoI, where the trained weight file is stored, the name of the configuration file, etc. Model configuration requires the definition of a large number of attributes, defining a configuration file that specifically records these parameters.
Defining a loss function
The loss function is the basis of training optimization, and the reasonable model function is defined, so that the calculated amount in the training process can be greatly reduced, and a better training result can be obtained. Four loss functions, namely an RPN regression loss function, an RPN classification loss function, a full-connection network regression loss function and a full-connection network classification loss function, are designed for training at different stages.
Network architecture
The architecture of the detection network comprises five parts, a coiled image calculation part, a shared convolutional neural network, a deep convolutional feature fusion part, a region recommendation part and a full connection regression and classification network part. The depth convolution feature map fusion part generates a depth convolution feature map by fusing all the winding images, and a more robust depth convolution feature map is obtained, and is less influenced by shielding, large deformation of a target, motion blurring and the like compared with a convolution feature map extracted by FasterR-CNN based on a single image. Compared with the method in the document [2], the method adds a winding image calculation part, and the generated convolution characteristic map reflects the motion information of the target object more accurately, so that a depth convolution characteristic map with higher quality can be obtained.
3. Detection module design
The module mainly detects the video or image uploaded by the user by using the trained weight file, and detects and marks the name and the position of the target object, and the main steps are as follows:
(1) Video or pictures are read in. And respectively preparing a video and picture test set, and inputting videos and pictures containing various target objects can be detected.
(2) Loading model and configuration information. The model is a model which can be used for detection and is trained by the training module, only the model is needed to be loaded from the file, and in addition, the configuration information is basically regulated by the training module, only the model is needed to be loaded from the file.
(3) A network is defined. All networks need to be defined first, based on the definition of the network in fig. 8.
The method comprises the steps of defining an optical flow field network and a winding image calculation module in sequence, defining shared convolution networks shared_layers, defining a fusion algorithm of a depth convolution characteristic map, defining an area recommendation network RPN, and finally defining a fully connected network for a classification network and a frame regression network.
(4) It is determined whether a picture or a video is input. If the picture is the picture, directly inputting the picture into a detection network for detection, and obtaining a detection result; if the video is the video, the OpenCV is used for converting the video into pictures of a frame, the pictures are respectively input into the network for detection, and the detected pictures are converted into the video.
Convolution Feature Map (Feature Map) fusion module design
According to the above convolution feature map fusion algorithm, as shown in fig. 9, a fusion module of the convolution feature map is designed. Firstly, calculating an optical flow field between an adjacent frame and a current frame, finding pixel motion vectors of the adjacent frame and the current frame through the optical flow field, then, calculating a winding image by the adjacent frame according to the motion vectors of the optical flow field, inputting the winding image into a convolutional neural network on the basis, extracting a convolutional feature map, and finally, fusing the convolutional feature maps of a plurality of frames before and after the fusion, wherein a fusion specific algorithm can be realized by referring to a fusion algorithm 1 of the convolutional feature map or a fusion algorithm 2 of the convolutional feature map. The final convolution characteristic map combines the target global information of the front and rear multiframes, and the problems that the detection accuracy is reduced or the target cannot be detected at all due to shielding, deformation and motion blurring can be solved.
The technical proposal of the invention has the beneficial effects that
Through testing on the cartoon image, the detection precision is generally improved on the cartoon image with larger deformation by using a time sequence expansion method. As shown in fig. 6, the left graph of the detection accuracy is the result (63%) of the fast R-CNN detection of mickey_mouse, the right graph is the result (84%) of the detection result of the time-series fast R-CNN fusion of mickey_mouse, and in this image, the cartoon character is greatly deformed compared with the preceding and following frames, and it can be seen that the detection accuracy is improved from 63% to 84% using the target detection method based on the time-series expansion.

Claims (4)

1. A method for detecting and identifying an image sequence object, the method comprising the steps of:
s1, inputting an image sequence to be processed, intercepting an image containing a target object, and calculating optical flow fields of a current frame and an adjacent frame of the image sequence;
s2: calculating a winding image of the adjacent frame relative to the current frame according to the motion vector of the optical flow field:
the optical flow field between adjacent images is calculated,
Figure QLYQS_1
to obtain an image->
Figure QLYQS_2
Relative to the image
Figure QLYQS_3
Motion offsets between pixels of (a); the wrapped image is obtained by a bilinear interpolation image wrapping transform based on the motion offset between pixels, as shown in equation (2), where BiLinearWarp is a bilinear interpolation transform used to generate the transformed wrapped image>
Figure QLYQS_4
Figure QLYQS_5
(2);
Wherein: the bilinear interpolation may be replaced by either nearest neighbor interpolation or lansos interpolation or bicubic interpolation;
s3, respectively inputting the winding images of the current frame and the adjacent frames into a convolutional neural network to calculate a convolutional feature map;
s4, after the convolution feature maps of the current frame and the adjacent frames are obtained, fusion of the convolution feature maps is carried out, and the fused convolution feature maps are obtained:
s5, target detection and identification are carried out according to the fused convolution characteristic map, and detection and identification results are output.
2. The method for detecting and identifying an image sequence object according to claim 1, wherein: the adjacent frames of the image sequence refer to the previous K frames of the current frame and the next K frames of the current frame, 2K frames are taken as the adjacent frames, K is a positive integer, and the value range is [1,20].
3. The method for detecting and identifying an image sequence object according to claim 1, wherein: the fusion algorithm of the convolution feature map in step S4 includes two kinds, one of which is: the fused convolution characteristic map is calculated through a formula (4) and needs to be satisfied
Figure QLYQS_6
The method comprises the steps of carrying out a first treatment on the surface of the About weight +.>
Figure QLYQS_7
Is set according to the normalized Gaussian function +.>
Figure QLYQS_8
To set the weight +.>
Figure QLYQS_9
Figure QLYQS_10
(4)
K is a positive integer, and the value range is 1,20];F j Is a convolution feature map;
Figure QLYQS_11
is the standard deviation value.
4. The method for detecting and identifying an image sequence object according to claim 1, wherein: the second fusion algorithm of the convolution feature map in step S4 is that the 2 nd fusion algorithm provides weights based on the unit level of the convolution feature map, and sets the convolution feature map F j Is [ w, h, depth ]]Where w, h represents the width and height of the spatial domain of the convolution feature map, depth represents the number of channels, where the convolution feature map element refers to a vector of a certain 1x1xdepth dimension over the spatial domain; for the convolved feature map position p of the j-th frame, the coefficient weightMap [ j ] is used][p]To set weights; to map position p for different convolution characteristics, different weights weightMap [ j ] are set][p]To calculate a fused convolution feature map, where equation (5) holds:
Figure QLYQS_12
(5)
k is a positive integer, and the value range is [1,20];
convolved feature map F j Is [ w, h, depth ]]The tensors of (2) are transformed through an embedded convolutional neural network; the embedded network structure is as follows:
(1) Performing dimension reduction processing through convolution transformation of 1x1 xnc;
(2) Then, performing convolution transformation of 3x3 xnc;
(3) Finally, carrying out convolution transformation of 1x1xdepth to restore to the original dimension;
where nc represents the number of channels of the embedded network with dimension reduction, and the value is 2 n Thus, the weight of the embedded network can be learned through the learning algorithm of the deep neural network, and the embedded network can map the original convolution characteristics
Figure QLYQS_13
Transforming to obtain
Figure QLYQS_14
So that the transformed convolution feature map +.>
Figure QLYQS_15
The method is more suitable for measuring the similarity between convolution characteristic map units of adjacent frames; the similarity between the convolution feature maps of the adjacent frame j and the current frame i is marked as SimMap [ j; i]The similarity calculation may be measured using a discrete cosine distance between the two vectors, a pearson correlation, an inverse euclidean distance, an inverse minkowski distance;
after computing the SimMap, for each position p, the SimMap similarity value can be normalized between [0,1] using a SoftMax function, resulting in a weightMap, as shown in equation (6):
Figure QLYQS_16
Figure QLYQS_17
(6)
after the weight map is calculated, the fused convolution feature map can be calculated by the weight map, as shown in formula (7):
Figure QLYQS_18
(7)。/>
CN201811080439.4A 2018-09-17 2018-09-17 Image sequence target detection and identification method Active CN109190581B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811080439.4A CN109190581B (en) 2018-09-17 2018-09-17 Image sequence target detection and identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811080439.4A CN109190581B (en) 2018-09-17 2018-09-17 Image sequence target detection and identification method

Publications (2)

Publication Number Publication Date
CN109190581A CN109190581A (en) 2019-01-11
CN109190581B true CN109190581B (en) 2023-05-30

Family

ID=64911361

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811080439.4A Active CN109190581B (en) 2018-09-17 2018-09-17 Image sequence target detection and identification method

Country Status (1)

Country Link
CN (1) CN109190581B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108364466B (en) * 2018-02-11 2021-01-26 金陵科技学院 Traffic flow statistical method based on unmanned aerial vehicle traffic video
CN109919051A (en) * 2019-02-22 2019-06-21 同济大学 A kind of convolutional neural networks accelerated method for video image processing
CN109919874B (en) * 2019-03-07 2023-06-02 腾讯科技(深圳)有限公司 Image processing method, device, computer equipment and storage medium
CN111753821A (en) * 2019-03-27 2020-10-09 杭州海康威视数字技术股份有限公司 Text detection method and device
CN109978465B (en) * 2019-03-29 2021-08-03 江苏满运软件科技有限公司 Goods source recommendation method and device, electronic equipment and storage medium
CN110147733B (en) * 2019-04-16 2020-04-14 北京航空航天大学 Cross-domain large-range scene generation method
CN110070511B (en) * 2019-04-30 2022-01-28 北京市商汤科技开发有限公司 Image processing method and device, electronic device and storage medium
CN110728270A (en) * 2019-12-17 2020-01-24 北京影谱科技股份有限公司 Method, device and equipment for removing video character and computer readable storage medium
CN111160229B (en) * 2019-12-26 2024-04-02 北京工业大学 SSD network-based video target detection method and device
CN111179246B (en) * 2019-12-27 2021-01-29 中国科学院上海微系统与信息技术研究所 Pixel displacement confirming method and device, electronic equipment and storage medium
CN111402292B (en) * 2020-03-10 2023-04-07 南昌航空大学 Image sequence optical flow calculation method based on characteristic deformation error occlusion detection
CN113298723A (en) * 2020-07-08 2021-08-24 阿里巴巴集团控股有限公司 Video processing method and device, electronic equipment and computer storage medium
CN112241180B (en) * 2020-10-22 2021-08-17 北京航空航天大学 Visual processing method for landing guidance of unmanned aerial vehicle mobile platform
CN112307978B (en) * 2020-10-30 2022-05-24 腾讯科技(深圳)有限公司 Target detection method and device, electronic equipment and readable storage medium
CN112581301B (en) * 2020-12-17 2023-12-29 塔里木大学 Detection and early warning method and system for residual quantity of farmland residual film based on deep learning
CN113111842B (en) * 2021-04-26 2023-06-27 浙江商汤科技开发有限公司 Action recognition method, device, equipment and computer readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104205169A (en) * 2011-12-21 2014-12-10 皮埃尔和玛利居里大学(巴黎第六大学) Method of estimating optical flow on the basis of an asynchronous light sensor
CN107305635A (en) * 2016-04-15 2017-10-31 株式会社理光 Object identifying method, object recognition equipment and classifier training method
CN107481185A (en) * 2017-08-24 2017-12-15 深圳市唯特视科技有限公司 A kind of style conversion method based on video image optimization
CN107886120A (en) * 2017-11-03 2018-04-06 北京清瑞维航技术发展有限公司 Method and apparatus for target detection tracking
CN108062531A (en) * 2017-12-25 2018-05-22 南京信息工程大学 A kind of video object detection method that convolutional neural networks are returned based on cascade
CN108256496A (en) * 2018-02-01 2018-07-06 江南大学 A kind of stockyard smog detection method based on video

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10181195B2 (en) * 2015-12-28 2019-01-15 Facebook, Inc. Systems and methods for determining optical flow
US9858496B2 (en) * 2016-01-20 2018-01-02 Microsoft Technology Licensing, Llc Object detection and classification in images

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104205169A (en) * 2011-12-21 2014-12-10 皮埃尔和玛利居里大学(巴黎第六大学) Method of estimating optical flow on the basis of an asynchronous light sensor
CN107305635A (en) * 2016-04-15 2017-10-31 株式会社理光 Object identifying method, object recognition equipment and classifier training method
CN107481185A (en) * 2017-08-24 2017-12-15 深圳市唯特视科技有限公司 A kind of style conversion method based on video image optimization
CN107886120A (en) * 2017-11-03 2018-04-06 北京清瑞维航技术发展有限公司 Method and apparatus for target detection tracking
CN108062531A (en) * 2017-12-25 2018-05-22 南京信息工程大学 A kind of video object detection method that convolutional neural networks are returned based on cascade
CN108256496A (en) * 2018-02-01 2018-07-06 江南大学 A kind of stockyard smog detection method based on video

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Philipp Fischer et al .FlowNet: Learning Optical Flow with Convolutional Networks.《arXiv》.2015,1-13. *
Xizhou Zhu et al.Deep Feature Flow for Video Recognition.《arXiv》.2016,1-14. *
Xizhou Zhu et al.Flow-Guided Feature Aggregation for Video Object Detection.《arXiv》.2017,1-10. *
王正来 等.基于深度卷积神经网络的运动目标光流检测方法.《光电工程》.2018,第45卷(第8期),180027-1至180027-10. *

Also Published As

Publication number Publication date
CN109190581A (en) 2019-01-11

Similar Documents

Publication Publication Date Title
CN109190581B (en) Image sequence target detection and identification method
Wang et al. Joint filtering of intensity images and neuromorphic events for high-resolution noise-robust imaging
Zhang et al. Deep image deblurring: A survey
Nguyen et al. Super-resolution for biometrics: A comprehensive survey
Aleotti et al. Generative adversarial networks for unsupervised monocular depth prediction
Su et al. Deep video deblurring for hand-held cameras
Uittenbogaard et al. Privacy protection in street-view panoramas using depth and multi-view imagery
Song et al. Joint face hallucination and deblurring via structure generation and detail enhancement
Jinno et al. Multiple exposure fusion for high dynamic range image acquisition
Gajjar et al. New learning based super-resolution: use of DWT and IGMRF prior
Zheng et al. Learning Cross-scale Correspondence and Patch-based Synthesis for Reference-based Super-Resolution.
Yang et al. Depth recovery using an adaptive color-guided auto-regressive model
Younus et al. Effective and fast deepfake detection method based on haar wavelet transform
Liu et al. Satellite video super-resolution based on adaptively spatiotemporal neighbors and nonlocal similarity regularization
US11651581B2 (en) System and method for correspondence map determination
Su et al. Super-resolution without dense flow
Wu et al. Single-shot face anti-spoofing for dual pixel camera
Kim et al. High-quality depth map up-sampling robust to edge noise of range sensors
Guarnieri et al. Perspective registration and multi-frame super-resolution of license plates in surveillance videos
Liu et al. Depth-guided sparse structure-from-motion for movies and tv shows
Liu et al. Deep joint estimation network for satellite video super-resolution with multiple degradations
KR101921608B1 (en) Apparatus and method for generating depth information
Liu et al. Unsupervised optical flow estimation for differently exposed images in LDR domain
CN113421186A (en) Apparatus and method for unsupervised video super-resolution using a generation countermeasure network
Thapa et al. Learning to Remove Refractive Distortions from Underwater Images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant