CN109829398B

CN109829398B - Target detection method in video based on three-dimensional convolution network

Info

Publication number: CN109829398B
Application number: CN201910041920.0A
Authority: CN
Inventors: 王田; 李玮匡; 单光存
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2019-01-16
Filing date: 2019-01-16
Publication date: 2020-03-31
Anticipated expiration: 2039-01-16
Also published as: CN109829398A

Abstract

The invention relates to a target detection method in a video based on a three-dimensional convolution network, which comprises the following steps: performing fusion training on the whole network by using a cross training method, extracting features by using a three-dimensional convolutional network, fusing context information of frames before and after fusion, generating a candidate region with a network generated by using the candidate region, performing standard pooling on the candidate region by using a region standard pooling method, classifying each candidate region, performing regression fine adjustment on a boundary frame, and filtering redundancy detection results by using non-maximum suppression. The detection method utilizes a two-stage detection framework based on a candidate area generation network to detect a target; in the feature extraction process, in order to fully utilize the time sequence information of an image sequence in a video, a cube is formed by a plurality of frames before and after a target frame to be detected, and a three-dimensional convolution network is used for feature extraction, so that the accurate target detection effect in the video is realized.

Description

Target detection method in video based on three-dimensional convolution network

Technical Field

The invention relates to a processing technology of an image sequence in a video, in particular to a target detection method in the video based on a three-dimensional convolution network.

Background

Target detection in videos is an important task in computer vision, and is widely applied to many fields such as automatic driving and visual navigation of unmanned aerial vehicles. Object detection in video requires that the bounding box coordinates of the object and the prediction information of the object class are given in each frame of the image sequence in the video. Most of the existing target detection methods directly detect a single-frame image, and if the methods are directly used for detecting a target in a video, time sequence information of an image sequence in the video cannot be used, so that the detection precision is reduced. The three-dimensional convolution network is a common network for processing image sequences in videos, and compared with a two-dimensional convolution network, the three-dimensional convolution network has more time dimensions, and can effectively extract time sequence information of the image sequences. In order to extract time sequence information of an image sequence in a video and obtain an accurate target detection result in the video, the research of the target detection method in the video based on the three-dimensional convolution network is significant.

Disclosure of Invention

The invention solves the problems: the method overcomes the defects of the prior art, provides a high-detection-precision target detection method in the video based on the three-dimensional convolution network, and fully excavates and utilizes the time sequence information of the image sequence in the video, thereby improving the detection precision.

The technical scheme provided by the invention is as follows: a target detection method in a video based on a three-dimensional convolution network is realized by the following steps:

step 1, reading videos and corresponding labels of training samples in a database, decomposing the videos of the training samples into a continuous N' frame image sequence, and stacking a plurality of forward frame images and backward images with the same frame number of each frame image in the image sequence to obtain N cube structures, wherein N is N;

step 2, constructing a three-dimensional convolution feature extraction network, a candidate area generation network and a detection network, using the N cube structures and the corresponding labels obtained in the step 1, and performing fusion training on the three-dimensional convolution feature extraction network, the candidate area generation network and the detection network by using a cross training method to obtain the three-dimensional convolution feature extraction network, the candidate area generation network and the detection network which can be used for target detection in the video;

step 3, reading a video to be detected, decomposing the video to be detected into continuous M 'frame images, and stacking a plurality of frames before and after each frame image to obtain M cube structures, wherein M' is M;

step 4, one of the M cubic structures obtained in the step 3 is taken, a three-dimensional convolution feature extraction network is used, and features of the cubic structures are extracted to obtain corresponding feature maps;

step 5, inputting the characteristic diagram obtained in the step 4 into a candidate area generation network, predicting a candidate area possibly provided with a target, and obtaining the coordinate x of the candidate area_p,y_p,w_p,h_pAnd the probability P of the presence of an object_is,P_not，P_isProbability of existence of object, P_notProbability of absence of object, x_p,y_pIs the horizontal and vertical coordinate of the center point of the candidate region, w_p,h_pWidth and height of the candidate region;

step 6, setting the probability P of the existence of the target_isA threshold P _ threshold, a probability P that an object will be present_isMapping the area larger than the set threshold value P _ threshold onto the characteristic diagram obtained in the step 4;

step 7, performing area standard pooling on the area mapped to the feature map obtained in the step 6, and pooling candidate areas with different sizes into a feature map with a fixed size;

step 8, for each feature map with fixed size obtained in the step 7, using a detection network to classify each feature map with fixed size and perform regression fine adjustment on a boundary box, so as to obtain the classification category of the target, the probability P that the target belongs to the category and the coordinates x, y, w, h of the boundary box of the target, wherein P is the probability that the target belongs to the category, x and y are the horizontal and vertical coordinates of the center point of the boundary box of the target, and w and h are the width and height of the boundary box of the target;

step 9, filtering the detected targets with higher overlapping degree by adopting non-maximum inhibition, calculating the ratio of the area of the intersection part of the regions where the targets are located to the area of the union part of the detected targets of each type, and only keeping the detection result with the maximum probability P of the targets belonging to the category when the ratio of the area of the intersection part to the area of the union part exceeds a specified threshold IOU _ threshold, and filtering other detection results;

and step 10, repeating the processes from step 4 to step 9 for the M cubic structures obtained in step 3, and respectively detecting to obtain a detection result of each frame of the image sequence in the video.

In step 1, the method for obtaining the cubic structure is as follows:

decomposing a video of a training sample into a continuous N 'frame image sequence, taking forward one frame image and backward one frame image of each frame image of the N' frame image sequence so as to receive certain time context information, and stacking the 2l +1 frame images to form a cubic structure with the size of W multiplied by H multiplied by (2l +1), wherein W is the width of the image, H is the height of the image, and 2l +1 represents the frame number of the stacked images; when the number of forward or backward image frames is less than l frames at the beginning and end of the image sequence, the size of the cube is still W × H × (2l +1) using a zero padding method, resulting in N cube structures with a size W × H × (2l +1), N ═ N.

The process in step 2 is as follows:

step 21, stacking a plurality of layers of three-dimensional convolution layers and three-dimensional pooling layers to construct a three-dimensional convolution feature extraction network, training the three-dimensional convolution feature extraction network by using a Sport1M database as a training sample aiming at a video classification task, and taking an obtained weight as an initial weight of the three-dimensional convolution feature extraction network;

step 22, constructing a candidate area generation network by using the two-dimensional convolution layer and the full-connection layer, and using the randomly initialized weight as an initial weight of the candidate area generation network;

step 23, constructing a detection network, wherein the detection network is composed of a classification sub-network and a regression sub-network, the structures of the classification sub-network and the regression sub-network are all full connection layers, and the weight value of random initialization is used as an initial weight value;

step 24, using the N cube structures and the corresponding labels obtained in step 1, training the candidate region generation network obtained in step 22 and the three-dimensional convolution feature extraction network obtained in step 21, wherein a loss function of the training is L_rpn＝L_P+L_regWherein L is_PGenerating a cross entropy, L, of the probability of the presence of the target and the true value of the tag output by the network for the candidate region_regGenerating a square sum of the coordinate offset of the candidate area output by the network and the coordinate offset of the target area in the label for the candidate area;

step 25, training the detection network obtained in the step 23 and the three-dimensional convolution feature extraction network obtained in the step 21, wherein the trained loss function is the weighted sum of the detection network output classification result loss and the coordinate regression loss;

step 26, repeat steps 24 and 25 several times until the loss function in steps 24 and 25 stabilizes.

In the step 4, the structure of the three-dimensional convolution feature extraction network is as follows:

the overall structure of the three-dimensional convolution characteristic extraction network comprises a plurality of layers of nested three-dimensional convolution layers and three-dimensional pooling layers; the convolution kernel of the three-dimensional convolution is a tensor having three dimensions of length, width and height; in the output signature, the response output at spatial coordinates (a, b, c) is calculated by:

in the above formula, W_ijgIs the weight of the convolution kernel at position (i, j, g), X_{(a+i)(b+j)(c+g)}Is the value of the input cube unit at position (a + i, b + j, c + g), v is the bias term, sw, sh, sl are the width and height of the three-dimensional convolution kernel, H, H_abcIs the response output at spatial coordinates (a, b, c), f is the activation function;

in step 5, the process of generating the candidate region is as follows:

step 51, for the feature map obtained by the three-dimensional convolution feature extraction network in the step 4, sliding on the feature map by using a two-dimensional convolution kernel with the size of 3 × 3, performing convolution calculation, and obtaining a 512-dimensional vector at each position where the convolution kernel slides;

and step 52, setting 9 anchor boxes as reference at each position of the convolution kernel sliding, setting the aspect ratio of the anchor boxes to be three proportions according to 1:2, 1:1 and 2:1, and setting the area size to be 128²、256²、512²The pixel has three sizes, and the center point of the anchor frame is the center of the sliding window;

step 53, the 512-dimensional vectors obtained at each position where the convolution kernel slides in step 51 are passed through a full-connection networkOutputting 9 vectors of 6 dimensions representing the offsets d of the coordinates of the center point, the length and the width of the candidate region with respect to the anchor block set in step 52_x,d_y,d_h,d_wAnd the probability P of the presence or absence of an object_is,P_notWherein: d_x＝(x_p-x_a)/w_a，d_y＝(y_p-y_a)/h_a，d_h＝log(h_p/h_a)，d_w＝log(w_p/w_a)，x_p,y_p,w_p,h_pRepresenting the coordinates of the center point of the candidate region, width and height, x_a,y_a,h_a,w_aCoordinates of center point of anchor block, length and width, P_is,P_notPerforming normalization processing by using a softmax function to indicate the probability of whether the target exists or not;

step 54 derives an offset d from step 53_x,d_y,d_h,d_wCoordinate of center point, length and width x of anchor block set in step 52_a,y_a,h_a,w_aCalculating the actual center point coordinates, width and height x of the generated candidate region_p,y_p,w_p,h_p。

The step 7 regional standard pooling process is as follows:

step 71, representing the size of the region to be pooled as m × n, dividing the region to be pooled into 7 × 7 small lattices with the size of about m/7 × n/7, and rounding up approximately when m/7 or n/7 cannot be rounded up;

step 72, in each small grid divided in step 71, using a maximum pooling method to pool the features in the small grid into 1 × 1 dimension, so as to pool the feature areas with different sizes into a 7 × 7 dimension feature map with a fixed size;

in step 8, the process of classifying the candidate region and fine-tuning the bounding box by the detection network is as follows:

step 81, flattening the feature graph with fixed size obtained in the step 7 into one-dimensional vectors, and respectively inputting the one-dimensional vectors into a classification sub-network and a regression sub-network;

step 82, the classification sub-network outputs n + 1-dimensional vectors { p over two layers of full connection₁,p₂,…,p_n+1}，p₁,p₂,…,p_nRepresenting the probability that the candidate region belongs to each of the n classes of objects, p_n+1Representing the probability that the candidate area belongs to the background, using a softmax function as an activation function by a network output layer, wherein n is the number of categories of the target to be detected;

step 83, the regression subnet outputs 4-dimensional vector t through two layers of full connection_x,t_y,t_w,t_hDenotes the offset of the target bounding box with respect to the candidate region, t_x＝(x-x_p)/w_a,t_y＝(y-y_p)/h_a,t_w＝log(w/w_a),t_h＝log(h/h_a) X, y are horizontal and vertical coordinates of the center point of the boundary box of the target, w, h are the width and height of the boundary box of the target, and x_p,y_p,w_p,h_pCoordinates, width and height of a central point of the candidate area are obtained;

step 84, solving the n + 1-dimensional vector { p obtained in step 82₁,p₂,…,p_n+1Of the maximum values, if the maximum value is P_n+1If the candidate area is the background, the candidate area is not output, otherwise, the type of the target is judged according to the maximum value and the target is judged according to the { t }_x,t_y,t_w,t_hCalculating the coordinates x, y, w, h of the bounding box of the target, and using the n + 1-dimensional vector { p }₁,p₂,…,p_n+1The maximum value in is taken as the probability P that the target belongs to the class.

In summary, the method for accurately detecting the target in the video based on the three-dimensional convolution according to the present invention includes: performing fusion training on the whole network by using a cross training method, extracting features by using a three-dimensional convolutional network, fusing context information of frames before and after fusion, generating a candidate region with a network generated by using the candidate region, performing standard pooling on the candidate region by using a region standard pooling method, classifying each candidate region, performing regression fine adjustment on a boundary frame, and filtering redundancy detection results by using non-maximum suppression. The detection method utilizes a two-stage detection framework based on a candidate area generation network to detect a target; in the feature extraction process, in order to fully utilize the time sequence information of an image sequence in a video, a cube is formed by a plurality of frames before and after a target frame to be detected, and a three-dimensional convolution network is used for feature extraction, so that the accurate target detection effect in the video is realized.

Compared with the prior art, the invention has the advantages that: compared with a single-frame image, an image sequence in a video has time sequence among continuous frames, and rich time sequence information exists. The method introduces a three-dimensional convolution network to extract the time sequence information of the continuous frame image sequence in the target detection in the video, fully utilizes and excavates the time sequence information characteristics between continuous frames, and achieves the accuracy of the target detection in the video.

Drawings

Fig. 1 is a schematic flow chart of the implementation of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

The invention relates to a complex target accurate identification method based on key area detection, which comprises the following steps: performing fusion training on the whole network by using a cross training method, extracting features by using a three-dimensional convolutional network, fusing context information of frames before and after fusion, generating a candidate region with a network generated by using the candidate region, performing standard pooling on the candidate region by using a region standard pooling method, classifying each candidate region, performing regression fine adjustment on a boundary frame, and filtering redundancy detection results by using non-maximum suppression. The detection method utilizes a two-stage detection framework based on a candidate area generation network to detect a target; in the feature extraction process, in order to fully utilize the time sequence information of an image sequence in a video, a cube is formed by a plurality of frames before and after a target frame to be detected, and a three-dimensional convolution network is used for feature extraction, so that the accurate target detection effect in the video is realized.

As shown in fig. 1, the present invention specifically implements the following steps:

step 2, constructing a three-dimensional convolution feature extraction network, a candidate area generation network and a detection network, using the N cube structures and the corresponding labels obtained in the step 1, and performing fusion training on the three-dimensional convolution feature extraction network, the candidate area generation network and the detection network by using a cross training method to obtain the three-dimensional convolution feature extraction network, the candidate area generation network and the detection network which can be used for target detection in the video; (ii) a

step 4, selecting one of the M cubic structures obtained in the step 3, and extracting the characteristics of the cubic structure by using a three-dimensional convolution characteristic extraction network to obtain a corresponding characteristic diagram;

step 6, setting the probability P of the existence of the target_isA threshold P _ threshold, a probability P that an object will be present_isMapping the candidate area larger than the set threshold value P _ threshold onto the characteristic diagram in the step 4;

step 9, filtering the detected targets with higher overlapping degree by adopting non-maximum inhibition, calculating the ratio of the area of the intersection part of the regions where the targets are located to the area of the union part of the detected targets of each type, and only keeping the maximum detection result with the maximum probability P of the targets belonging to the category when the ratio of the area of the intersection part to the area of the union part exceeds a specified threshold IOU _ threshold, and filtering other detection results;

In step 1, the method for obtaining the cubic structure is as follows:

decomposing a video of a training sample into a continuous N 'frame image sequence, taking a forward 2 frame image and a backward 2 frame image of each frame image of the N' frame image sequence, receiving certain time context information, and stacking the 5 frame images to form a cubic structure with the size of W multiplied by H multiplied by 5, wherein W is the width of the image, H is the height of the image, and 5 represents the number of frames of the stacked images; when the number of forward or backward image frames is less than 2 frames at the beginning and the end of the image sequence, the size of the cube is still W multiplied by H multiplied by 5 by using a zero padding method, so that N cube structures with the size of W multiplied by H multiplied by 5 are obtained, and N is equal to N;

the process in step 2 is as follows:

step 21, stacking 5 layers of three-dimensional convolution layers and three-dimensional pooling layers to construct a three-dimensional convolution feature extraction network, training the three-dimensional convolution feature extraction network by using a Sport1M database as a training sample aiming at a video classification task, and taking an obtained weight as an initial weight of the three-dimensional convolution feature extraction network;

step 22, constructing a candidate area generation network by using a layer of two-dimensional convolution layer and two layers of full connection layers, and using a randomly initialized weight as an initial weight of the candidate area generation network;

step 23, constructing a detection network, wherein the detection network consists of a classification sub-network and a regression sub-network, the classification sub-network and the regression sub-network are both two fully-connected layers, and the randomly initialized weight is used as an initial weight;

step 26, repeating step 24 and step 25 for 10000 times in total until the loss function in step 24 and step 25 is stable;

the overall structure of the three-dimensional convolution feature extraction network comprises 5 nested three-dimensional convolution layers and 5 three-dimensional pooling layers; the convolution kernel of the three-dimensional convolution is a tensor having three dimensions of length, width and height; in the output signature, the response output at spatial coordinates (a, b, c) is calculated by:

in step 5, the process of generating the candidate region is as follows:

step 53, outputting the 512-dimensional vectors obtained at each position where the convolution kernel slides in step 51 to 9 6-dimensional vectors through a full-connection network, wherein the vectors represent the center point coordinates, the length and the width offset d of the candidate region relative to the anchor block set in step 52_x,d_y,d_h,d_wAnd the probability P of the presence or absence of an object_is,P_notWherein: d_x＝(x_p-x_a)/w_a，d_y＝(y_p-y_a)/h_a，d_h＝log(h_p/h_a)，d_w＝log(w_p/w_a)，x_p,y_p,w_p,h_pRepresenting the coordinates of the center point of the candidate region, width and height, x_a,y_a,h_a,w_aCoordinates of center point of anchor block, length and width, P_is,P_notPerforming normalization processing by using a softmax function to indicate the probability of whether the target exists or not;

step 54 derives an offset d from step 53_x,d_y,d_h,d_wCoordinate of center point, length and width x of anchor block set in step 52_a,y_a,h_a,w_aCalculating the actual center point coordinates, width and height x of the generated candidate region_p,y_p,w_p,h_p；

In step 7, the regional standard pooling process is as follows:

step 72, in each small grid divided in step 71, using a maximum pooling method to pool the features in the small grid into 1 × 1 dimension, and pooling the feature areas with different sizes into a 7 × 7 dimension fixed-size feature map;

step 81, flattening the feature graph with fixed size obtained in the step 7 into one-dimensional vectors, and respectively inputting the one-dimensional vectors into a classification sub-network and a boundary frame regression sub-network;

step 83, the regression subnet outputs 4-dimensional vector t through two layers of full connection_x,t_y,t_w,t_hDenotes the offset of the target bounding box with respect to the candidate region, t_x＝(x-x_p)/w_a,t_y＝(y-y_p)/h_a,t_w＝log(w/w_a),t_h＝log(h/h_a) X, y are horizontal and vertical coordinates of the center point of the target boundary box, and w, h are width and height of the target boundary box，x_p,y_p,w_p,h_pCoordinates, width and height of a central point of the candidate area are obtained;

In summary, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A target detection method in a video based on a three-dimensional convolution network is characterized by comprising the following steps:

step 2, constructing a three-dimensional convolution feature extraction network, a candidate area generation network and a detection network, using the N cube structures and the corresponding labels obtained in the step 1, and performing fusion training on the three-dimensional convolution feature extraction network, the candidate area generation network and the detection network by using a cross training method to obtain the three-dimensional convolution feature extraction network, the candidate area generation network and the detection network for target detection in the video;

step 10, repeating the processes from step 4 to step 9 for the M cubic structures obtained in step 3, and respectively detecting to obtain a detection result of each frame of the image sequence in the video;

the process of the step 2 is as follows:

step 21, stacking a plurality of layers of three-dimensional convolution layers and three-dimensional pooling layers to construct a three-dimensional convolution feature extraction network, training the three-dimensional convolution feature extraction network by using a sports 1M database as a training sample aiming at a video classification task, and taking an obtained weight as an initial weight of the three-dimensional convolution feature extraction network;

step 26, repeating step 24 and step 25 several times until the loss functions in step 24 and step 25 are stable;

in step 5, the process of generating the candidate region is as follows:

step 53, outputting the 512-dimensional vectors obtained at each position where the convolution kernel slides in the step 51 through a full-connection network to 9 6-dimensional vectors; denotes an offset d representing the centroid coordinates, length and width of the candidate region with respect to the anchor block set in step 52_x,d_y,d_h,d_wAnd the probability P of the presence or absence of an object_is,P_notWherein: d_x＝(x_p-x_a)/w_a，d_y＝(y_p-y_a)/h_a，d_h＝log(h_p/h_a)，d_w＝log(w_p/w_a)，x_p,y_p,w_p,h_pRepresenting the coordinates of the center point of the candidate region, width and height, x_a,y_a,h_a,w_aCoordinates of center point of anchor block, height and width, P_is,P_notPerforming normalization processing by using a softmax function to represent the probability of whether the target exists;

step 54, the offset d obtained from step 53_x,d_y,d_h,d_wCoordinate of center point, length and width x of anchor block set in step 52_a,y_a,h_a,w_aCalculating the actual center point coordinates, width and height x of the generated candidate region_p,y_p,w_p,h_p。

2. The method for detecting the target in the video based on the three-dimensional convolutional network as claimed in claim 1, wherein: in step 1, the method for obtaining the cubic structure is as follows:

3. The method for detecting the target in the video based on the three-dimensional convolutional network as claimed in claim 1, wherein: in the step 4, the structure of the three-dimensional convolution feature extraction network is as follows:

in the above formula, W_ijgIs the weight of the convolution kernel at position (i, j, g), X_{(a+i)(b+j)(c+g)}Is the value of the input cube unit at position (a + i, b + j, c + g), v is the bias term, sw, sh, sl are the width and height of the three-dimensional convolution kernel, H, H_abcIs the response output at spatial coordinates (a, b, c) and f is the activation function.

4. The method for detecting the target in the video based on the three-dimensional convolutional network as claimed in claim 1, wherein: in step 7, the regional standard pooling process is as follows:

step 72, in each small grid divided in step 71, features in the small grid are pooled into 1 × 1 dimension by using the maximum pooling method, and feature areas with different sizes are pooled into a 7 × 7 dimension feature map with a fixed size.

5. The method for detecting the target in the video based on the three-dimensional convolutional network as claimed in claim 1, wherein: in step 8, the process of classifying the candidate region and fine-tuning the bounding box by the detection network is as follows:

step 83, the bounding box regression subnet outputs 4-dimensional vector t through two layers of full connection_x,t_y,t_w,t_hDenotes the offset of the target bounding box with respect to the candidate region, t_x＝(x-x_p)/w_a,t_y＝(y-y_p)/h_a,t_w＝log(w/w_a),t_h＝log(h/h_a) X, y are horizontal and vertical coordinates of the center point of the boundary box of the target, w, h are the width and height of the boundary box of the target, and x_p,y_p,w_p,h_pCoordinates, width and height of a central point of the candidate area are obtained;

step 84, solving the n + 1-dimensional vector { p obtained in step 82₁,p₂,…,p_n+1Of the maximum values, if the maximum value is P_n+1Indicating that the candidate region is background, no output is performed,otherwise, judging the category of the target according to the maximum value and according to the { t }_x,t_y,t_w,t_hCalculating the coordinates x, y, w, h of the bounding box of the target, and using the n + 1-dimensional vector { p }₁,p₂,…,p_n+1The maximum value in is taken as the probability P that the target belongs to the class.