CN109829398B - Target detection method in video based on three-dimensional convolution network - Google Patents

Target detection method in video based on three-dimensional convolution network Download PDF

Info

Publication number
CN109829398B
CN109829398B CN201910041920.0A CN201910041920A CN109829398B CN 109829398 B CN109829398 B CN 109829398B CN 201910041920 A CN201910041920 A CN 201910041920A CN 109829398 B CN109829398 B CN 109829398B
Authority
CN
China
Prior art keywords
network
target
dimensional
dimensional convolution
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910041920.0A
Other languages
Chinese (zh)
Other versions
CN109829398A (en
Inventor
王田
李玮匡
单光存
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN201910041920.0A priority Critical patent/CN109829398B/en
Publication of CN109829398A publication Critical patent/CN109829398A/en
Application granted granted Critical
Publication of CN109829398B publication Critical patent/CN109829398B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a target detection method in a video based on a three-dimensional convolution network, which comprises the following steps: performing fusion training on the whole network by using a cross training method, extracting features by using a three-dimensional convolutional network, fusing context information of frames before and after fusion, generating a candidate region with a network generated by using the candidate region, performing standard pooling on the candidate region by using a region standard pooling method, classifying each candidate region, performing regression fine adjustment on a boundary frame, and filtering redundancy detection results by using non-maximum suppression. The detection method utilizes a two-stage detection framework based on a candidate area generation network to detect a target; in the feature extraction process, in order to fully utilize the time sequence information of an image sequence in a video, a cube is formed by a plurality of frames before and after a target frame to be detected, and a three-dimensional convolution network is used for feature extraction, so that the accurate target detection effect in the video is realized.

Description

Target detection method in video based on three-dimensional convolution network
Technical Field
The invention relates to a processing technology of an image sequence in a video, in particular to a target detection method in the video based on a three-dimensional convolution network.
Background
Target detection in videos is an important task in computer vision, and is widely applied to many fields such as automatic driving and visual navigation of unmanned aerial vehicles. Object detection in video requires that the bounding box coordinates of the object and the prediction information of the object class are given in each frame of the image sequence in the video. Most of the existing target detection methods directly detect a single-frame image, and if the methods are directly used for detecting a target in a video, time sequence information of an image sequence in the video cannot be used, so that the detection precision is reduced. The three-dimensional convolution network is a common network for processing image sequences in videos, and compared with a two-dimensional convolution network, the three-dimensional convolution network has more time dimensions, and can effectively extract time sequence information of the image sequences. In order to extract time sequence information of an image sequence in a video and obtain an accurate target detection result in the video, the research of the target detection method in the video based on the three-dimensional convolution network is significant.
Disclosure of Invention
The invention solves the problems: the method overcomes the defects of the prior art, provides a high-detection-precision target detection method in the video based on the three-dimensional convolution network, and fully excavates and utilizes the time sequence information of the image sequence in the video, thereby improving the detection precision.
The technical scheme provided by the invention is as follows: a target detection method in a video based on a three-dimensional convolution network is realized by the following steps:
step 1, reading videos and corresponding labels of training samples in a database, decomposing the videos of the training samples into a continuous N' frame image sequence, and stacking a plurality of forward frame images and backward images with the same frame number of each frame image in the image sequence to obtain N cube structures, wherein N is N;
step 2, constructing a three-dimensional convolution feature extraction network, a candidate area generation network and a detection network, using the N cube structures and the corresponding labels obtained in the step 1, and performing fusion training on the three-dimensional convolution feature extraction network, the candidate area generation network and the detection network by using a cross training method to obtain the three-dimensional convolution feature extraction network, the candidate area generation network and the detection network which can be used for target detection in the video;
step 3, reading a video to be detected, decomposing the video to be detected into continuous M 'frame images, and stacking a plurality of frames before and after each frame image to obtain M cube structures, wherein M' is M;
step 4, one of the M cubic structures obtained in the step 3 is taken, a three-dimensional convolution feature extraction network is used, and features of the cubic structures are extracted to obtain corresponding feature maps;
step 5, inputting the characteristic diagram obtained in the step 4 into a candidate area generation network, predicting a candidate area possibly provided with a target, and obtaining the coordinate x of the candidate areap,yp,wp,hpAnd the probability P of the presence of an objectis,Pnot,PisProbability of existence of object, PnotProbability of absence of object, xp,ypIs the horizontal and vertical coordinate of the center point of the candidate region, wp,hpWidth and height of the candidate region;
step 6, setting the probability P of the existence of the targetisA threshold P _ threshold, a probability P that an object will be presentisMapping the area larger than the set threshold value P _ threshold onto the characteristic diagram obtained in the step 4;
step 7, performing area standard pooling on the area mapped to the feature map obtained in the step 6, and pooling candidate areas with different sizes into a feature map with a fixed size;
step 8, for each feature map with fixed size obtained in the step 7, using a detection network to classify each feature map with fixed size and perform regression fine adjustment on a boundary box, so as to obtain the classification category of the target, the probability P that the target belongs to the category and the coordinates x, y, w, h of the boundary box of the target, wherein P is the probability that the target belongs to the category, x and y are the horizontal and vertical coordinates of the center point of the boundary box of the target, and w and h are the width and height of the boundary box of the target;
step 9, filtering the detected targets with higher overlapping degree by adopting non-maximum inhibition, calculating the ratio of the area of the intersection part of the regions where the targets are located to the area of the union part of the detected targets of each type, and only keeping the detection result with the maximum probability P of the targets belonging to the category when the ratio of the area of the intersection part to the area of the union part exceeds a specified threshold IOU _ threshold, and filtering other detection results;
and step 10, repeating the processes from step 4 to step 9 for the M cubic structures obtained in step 3, and respectively detecting to obtain a detection result of each frame of the image sequence in the video.
In step 1, the method for obtaining the cubic structure is as follows:
decomposing a video of a training sample into a continuous N 'frame image sequence, taking forward one frame image and backward one frame image of each frame image of the N' frame image sequence so as to receive certain time context information, and stacking the 2l +1 frame images to form a cubic structure with the size of W multiplied by H multiplied by (2l +1), wherein W is the width of the image, H is the height of the image, and 2l +1 represents the frame number of the stacked images; when the number of forward or backward image frames is less than l frames at the beginning and end of the image sequence, the size of the cube is still W × H × (2l +1) using a zero padding method, resulting in N cube structures with a size W × H × (2l +1), N ═ N.
The process in step 2 is as follows:
step 21, stacking a plurality of layers of three-dimensional convolution layers and three-dimensional pooling layers to construct a three-dimensional convolution feature extraction network, training the three-dimensional convolution feature extraction network by using a Sport1M database as a training sample aiming at a video classification task, and taking an obtained weight as an initial weight of the three-dimensional convolution feature extraction network;
step 22, constructing a candidate area generation network by using the two-dimensional convolution layer and the full-connection layer, and using the randomly initialized weight as an initial weight of the candidate area generation network;
step 23, constructing a detection network, wherein the detection network is composed of a classification sub-network and a regression sub-network, the structures of the classification sub-network and the regression sub-network are all full connection layers, and the weight value of random initialization is used as an initial weight value;
step 24, using the N cube structures and the corresponding labels obtained in step 1, training the candidate region generation network obtained in step 22 and the three-dimensional convolution feature extraction network obtained in step 21, wherein a loss function of the training is Lrpn=LP+LregWherein L isPGenerating a cross entropy, L, of the probability of the presence of the target and the true value of the tag output by the network for the candidate regionregGenerating a square sum of the coordinate offset of the candidate area output by the network and the coordinate offset of the target area in the label for the candidate area;
step 25, training the detection network obtained in the step 23 and the three-dimensional convolution feature extraction network obtained in the step 21, wherein the trained loss function is the weighted sum of the detection network output classification result loss and the coordinate regression loss;
step 26, repeat steps 24 and 25 several times until the loss function in steps 24 and 25 stabilizes.
In the step 4, the structure of the three-dimensional convolution feature extraction network is as follows:
the overall structure of the three-dimensional convolution characteristic extraction network comprises a plurality of layers of nested three-dimensional convolution layers and three-dimensional pooling layers; the convolution kernel of the three-dimensional convolution is a tensor having three dimensions of length, width and height; in the output signature, the response output at spatial coordinates (a, b, c) is calculated by:
Figure GDA0002353733560000031
in the above formula, WijgIs the weight of the convolution kernel at position (i, j, g), X(a+i)(b+j)(c+g)Is the value of the input cube unit at position (a + i, b + j, c + g), v is the bias term, sw, sh, sl are the width and height of the three-dimensional convolution kernel, H, HabcIs the response output at spatial coordinates (a, b, c), f is the activation function;
in step 5, the process of generating the candidate region is as follows:
step 51, for the feature map obtained by the three-dimensional convolution feature extraction network in the step 4, sliding on the feature map by using a two-dimensional convolution kernel with the size of 3 × 3, performing convolution calculation, and obtaining a 512-dimensional vector at each position where the convolution kernel slides;
and step 52, setting 9 anchor boxes as reference at each position of the convolution kernel sliding, setting the aspect ratio of the anchor boxes to be three proportions according to 1:2, 1:1 and 2:1, and setting the area size to be 1282、2562、5122The pixel has three sizes, and the center point of the anchor frame is the center of the sliding window;
step 53, the 512-dimensional vectors obtained at each position where the convolution kernel slides in step 51 are passed through a full-connection networkOutputting 9 vectors of 6 dimensions representing the offsets d of the coordinates of the center point, the length and the width of the candidate region with respect to the anchor block set in step 52x,dy,dh,dwAnd the probability P of the presence or absence of an objectis,PnotWherein: dx=(xp-xa)/wa,dy=(yp-ya)/ha,dh=log(hp/ha),dw=log(wp/wa),xp,yp,wp,hpRepresenting the coordinates of the center point of the candidate region, width and height, xa,ya,ha,waCoordinates of center point of anchor block, length and width, Pis,PnotPerforming normalization processing by using a softmax function to indicate the probability of whether the target exists or not;
step 54 derives an offset d from step 53x,dy,dh,dwCoordinate of center point, length and width x of anchor block set in step 52a,ya,ha,waCalculating the actual center point coordinates, width and height x of the generated candidate regionp,yp,wp,hp
The step 7 regional standard pooling process is as follows:
step 71, representing the size of the region to be pooled as m × n, dividing the region to be pooled into 7 × 7 small lattices with the size of about m/7 × n/7, and rounding up approximately when m/7 or n/7 cannot be rounded up;
step 72, in each small grid divided in step 71, using a maximum pooling method to pool the features in the small grid into 1 × 1 dimension, so as to pool the feature areas with different sizes into a 7 × 7 dimension feature map with a fixed size;
in step 8, the process of classifying the candidate region and fine-tuning the bounding box by the detection network is as follows:
step 81, flattening the feature graph with fixed size obtained in the step 7 into one-dimensional vectors, and respectively inputting the one-dimensional vectors into a classification sub-network and a regression sub-network;
step 82, the classification sub-network outputs n + 1-dimensional vectors { p over two layers of full connection1,p2,…,pn+1},p1,p2,…,pnRepresenting the probability that the candidate region belongs to each of the n classes of objects, pn+1Representing the probability that the candidate area belongs to the background, using a softmax function as an activation function by a network output layer, wherein n is the number of categories of the target to be detected;
step 83, the regression subnet outputs 4-dimensional vector t through two layers of full connectionx,ty,tw,thDenotes the offset of the target bounding box with respect to the candidate region, tx=(x-xp)/wa,ty=(y-yp)/ha,tw=log(w/wa),th=log(h/ha) X, y are horizontal and vertical coordinates of the center point of the boundary box of the target, w, h are the width and height of the boundary box of the target, and xp,yp,wp,hpCoordinates, width and height of a central point of the candidate area are obtained;
step 84, solving the n + 1-dimensional vector { p obtained in step 821,p2,…,pn+1Of the maximum values, if the maximum value is Pn+1If the candidate area is the background, the candidate area is not output, otherwise, the type of the target is judged according to the maximum value and the target is judged according to the { t }x,ty,tw,thCalculating the coordinates x, y, w, h of the bounding box of the target, and using the n + 1-dimensional vector { p }1,p2,…,pn+1The maximum value in is taken as the probability P that the target belongs to the class.
In summary, the method for accurately detecting the target in the video based on the three-dimensional convolution according to the present invention includes: performing fusion training on the whole network by using a cross training method, extracting features by using a three-dimensional convolutional network, fusing context information of frames before and after fusion, generating a candidate region with a network generated by using the candidate region, performing standard pooling on the candidate region by using a region standard pooling method, classifying each candidate region, performing regression fine adjustment on a boundary frame, and filtering redundancy detection results by using non-maximum suppression. The detection method utilizes a two-stage detection framework based on a candidate area generation network to detect a target; in the feature extraction process, in order to fully utilize the time sequence information of an image sequence in a video, a cube is formed by a plurality of frames before and after a target frame to be detected, and a three-dimensional convolution network is used for feature extraction, so that the accurate target detection effect in the video is realized.
Compared with the prior art, the invention has the advantages that: compared with a single-frame image, an image sequence in a video has time sequence among continuous frames, and rich time sequence information exists. The method introduces a three-dimensional convolution network to extract the time sequence information of the continuous frame image sequence in the target detection in the video, fully utilizes and excavates the time sequence information characteristics between continuous frames, and achieves the accuracy of the target detection in the video.
Drawings
Fig. 1 is a schematic flow chart of the implementation of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
The invention relates to a complex target accurate identification method based on key area detection, which comprises the following steps: performing fusion training on the whole network by using a cross training method, extracting features by using a three-dimensional convolutional network, fusing context information of frames before and after fusion, generating a candidate region with a network generated by using the candidate region, performing standard pooling on the candidate region by using a region standard pooling method, classifying each candidate region, performing regression fine adjustment on a boundary frame, and filtering redundancy detection results by using non-maximum suppression. The detection method utilizes a two-stage detection framework based on a candidate area generation network to detect a target; in the feature extraction process, in order to fully utilize the time sequence information of an image sequence in a video, a cube is formed by a plurality of frames before and after a target frame to be detected, and a three-dimensional convolution network is used for feature extraction, so that the accurate target detection effect in the video is realized.
As shown in fig. 1, the present invention specifically implements the following steps:
step 1, reading videos and corresponding labels of training samples in a database, decomposing the videos of the training samples into a continuous N' frame image sequence, and stacking a plurality of forward frame images and backward images with the same frame number of each frame image in the image sequence to obtain N cube structures, wherein N is N;
step 2, constructing a three-dimensional convolution feature extraction network, a candidate area generation network and a detection network, using the N cube structures and the corresponding labels obtained in the step 1, and performing fusion training on the three-dimensional convolution feature extraction network, the candidate area generation network and the detection network by using a cross training method to obtain the three-dimensional convolution feature extraction network, the candidate area generation network and the detection network which can be used for target detection in the video; (ii) a
Step 3, reading a video to be detected, decomposing the video to be detected into continuous M 'frame images, and stacking a plurality of frames before and after each frame image to obtain M cube structures, wherein M' is M;
step 4, selecting one of the M cubic structures obtained in the step 3, and extracting the characteristics of the cubic structure by using a three-dimensional convolution characteristic extraction network to obtain a corresponding characteristic diagram;
step 5, inputting the characteristic diagram obtained in the step 4 into a candidate area generation network, predicting a candidate area possibly provided with a target, and obtaining the coordinate x of the candidate areap,yp,wp,hpAnd the probability P of the presence of an objectis,Pnot,PisProbability of existence of object, PnotProbability of absence of object, xp,ypIs the horizontal and vertical coordinate of the center point of the candidate region, wp,hpWidth and height of the candidate region;
step 6, setting the probability P of the existence of the targetisA threshold P _ threshold, a probability P that an object will be presentisMapping the candidate area larger than the set threshold value P _ threshold onto the characteristic diagram in the step 4;
step 7, performing area standard pooling on the area mapped to the feature map obtained in the step 6, and pooling candidate areas with different sizes into a feature map with a fixed size;
step 8, for each feature map with fixed size obtained in the step 7, using a detection network to classify each feature map with fixed size and perform regression fine adjustment on a boundary box, so as to obtain the classification category of the target, the probability P that the target belongs to the category and the coordinates x, y, w, h of the boundary box of the target, wherein P is the probability that the target belongs to the category, x and y are the horizontal and vertical coordinates of the center point of the boundary box of the target, and w and h are the width and height of the boundary box of the target;
step 9, filtering the detected targets with higher overlapping degree by adopting non-maximum inhibition, calculating the ratio of the area of the intersection part of the regions where the targets are located to the area of the union part of the detected targets of each type, and only keeping the maximum detection result with the maximum probability P of the targets belonging to the category when the ratio of the area of the intersection part to the area of the union part exceeds a specified threshold IOU _ threshold, and filtering other detection results;
and step 10, repeating the processes from step 4 to step 9 for the M cubic structures obtained in step 3, and respectively detecting to obtain a detection result of each frame of the image sequence in the video.
In step 1, the method for obtaining the cubic structure is as follows:
decomposing a video of a training sample into a continuous N 'frame image sequence, taking a forward 2 frame image and a backward 2 frame image of each frame image of the N' frame image sequence, receiving certain time context information, and stacking the 5 frame images to form a cubic structure with the size of W multiplied by H multiplied by 5, wherein W is the width of the image, H is the height of the image, and 5 represents the number of frames of the stacked images; when the number of forward or backward image frames is less than 2 frames at the beginning and the end of the image sequence, the size of the cube is still W multiplied by H multiplied by 5 by using a zero padding method, so that N cube structures with the size of W multiplied by H multiplied by 5 are obtained, and N is equal to N;
the process in step 2 is as follows:
step 21, stacking 5 layers of three-dimensional convolution layers and three-dimensional pooling layers to construct a three-dimensional convolution feature extraction network, training the three-dimensional convolution feature extraction network by using a Sport1M database as a training sample aiming at a video classification task, and taking an obtained weight as an initial weight of the three-dimensional convolution feature extraction network;
step 22, constructing a candidate area generation network by using a layer of two-dimensional convolution layer and two layers of full connection layers, and using a randomly initialized weight as an initial weight of the candidate area generation network;
step 23, constructing a detection network, wherein the detection network consists of a classification sub-network and a regression sub-network, the classification sub-network and the regression sub-network are both two fully-connected layers, and the randomly initialized weight is used as an initial weight;
step 24, using the N cube structures and the corresponding labels obtained in step 1, training the candidate region generation network obtained in step 22 and the three-dimensional convolution feature extraction network obtained in step 21, wherein a loss function of the training is Lrpn=LP+LregWherein L isPGenerating a cross entropy, L, of the probability of the presence of the target and the true value of the tag output by the network for the candidate regionregGenerating a square sum of the coordinate offset of the candidate area output by the network and the coordinate offset of the target area in the label for the candidate area;
step 25, training the detection network obtained in the step 23 and the three-dimensional convolution feature extraction network obtained in the step 21, wherein the trained loss function is the weighted sum of the detection network output classification result loss and the coordinate regression loss;
step 26, repeating step 24 and step 25 for 10000 times in total until the loss function in step 24 and step 25 is stable;
in the step 4, the structure of the three-dimensional convolution feature extraction network is as follows:
the overall structure of the three-dimensional convolution feature extraction network comprises 5 nested three-dimensional convolution layers and 5 three-dimensional pooling layers; the convolution kernel of the three-dimensional convolution is a tensor having three dimensions of length, width and height; in the output signature, the response output at spatial coordinates (a, b, c) is calculated by:
Figure GDA0002353733560000071
in the above formula, WijgIs the weight of the convolution kernel at position (i, j, g), X(a+i)(b+j)(c+g)Is the value of the input cube unit at position (a + i, b + j, c + g), v is the bias term, sw, sh, sl are the width and height of the three-dimensional convolution kernel, H, HabcIs the response output at spatial coordinates (a, b, c), f is the activation function;
in step 5, the process of generating the candidate region is as follows:
step 51, for the feature map obtained by the three-dimensional convolution feature extraction network in the step 4, sliding on the feature map by using a two-dimensional convolution kernel with the size of 3 × 3, performing convolution calculation, and obtaining a 512-dimensional vector at each position where the convolution kernel slides;
and step 52, setting 9 anchor boxes as reference at each position of the convolution kernel sliding, setting the aspect ratio of the anchor boxes to be three proportions according to 1:2, 1:1 and 2:1, and setting the area size to be 1282、2562、5122The pixel has three sizes, and the center point of the anchor frame is the center of the sliding window;
step 53, outputting the 512-dimensional vectors obtained at each position where the convolution kernel slides in step 51 to 9 6-dimensional vectors through a full-connection network, wherein the vectors represent the center point coordinates, the length and the width offset d of the candidate region relative to the anchor block set in step 52x,dy,dh,dwAnd the probability P of the presence or absence of an objectis,PnotWherein: dx=(xp-xa)/wa,dy=(yp-ya)/ha,dh=log(hp/ha),dw=log(wp/wa),xp,yp,wp,hpRepresenting the coordinates of the center point of the candidate region, width and height, xa,ya,ha,waCoordinates of center point of anchor block, length and width, Pis,PnotPerforming normalization processing by using a softmax function to indicate the probability of whether the target exists or not;
step 54 derives an offset d from step 53x,dy,dh,dwCoordinate of center point, length and width x of anchor block set in step 52a,ya,ha,waCalculating the actual center point coordinates, width and height x of the generated candidate regionp,yp,wp,hp
In step 7, the regional standard pooling process is as follows:
step 71, representing the size of the region to be pooled as m × n, dividing the region to be pooled into 7 × 7 small lattices with the size of about m/7 × n/7, and rounding up approximately when m/7 or n/7 cannot be rounded up;
step 72, in each small grid divided in step 71, using a maximum pooling method to pool the features in the small grid into 1 × 1 dimension, and pooling the feature areas with different sizes into a 7 × 7 dimension fixed-size feature map;
in step 8, the process of classifying the candidate region and fine-tuning the bounding box by the detection network is as follows:
step 81, flattening the feature graph with fixed size obtained in the step 7 into one-dimensional vectors, and respectively inputting the one-dimensional vectors into a classification sub-network and a boundary frame regression sub-network;
step 82, the classification sub-network outputs n + 1-dimensional vectors { p over two layers of full connection1,p2,…,pn+1},p1,p2,…,pnRepresenting the probability that the candidate region belongs to each of the n classes of objects, pn+1Representing the probability that the candidate area belongs to the background, using a softmax function as an activation function by a network output layer, wherein n is the number of categories of the target to be detected;
step 83, the regression subnet outputs 4-dimensional vector t through two layers of full connectionx,ty,tw,thDenotes the offset of the target bounding box with respect to the candidate region, tx=(x-xp)/wa,ty=(y-yp)/ha,tw=log(w/wa),th=log(h/ha) X, y are horizontal and vertical coordinates of the center point of the target boundary box, and w, h are width and height of the target boundary box,xp,yp,wp,hpCoordinates, width and height of a central point of the candidate area are obtained;
step 84, solving the n + 1-dimensional vector { p obtained in step 821,p2,…,pn+1Of the maximum values, if the maximum value is Pn+1If the candidate area is the background, the candidate area is not output, otherwise, the type of the target is judged according to the maximum value and the target is judged according to the { t }x,ty,tw,thCalculating the coordinates x, y, w, h of the bounding box of the target, and using the n + 1-dimensional vector { p }1,p2,…,pn+1The maximum value in is taken as the probability P that the target belongs to the class.
In summary, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (5)

1. A target detection method in a video based on a three-dimensional convolution network is characterized by comprising the following steps:
step 1, reading videos and corresponding labels of training samples in a database, decomposing the videos of the training samples into a continuous N' frame image sequence, and stacking a plurality of forward frame images and backward images with the same frame number of each frame image in the image sequence to obtain N cube structures, wherein N is N;
step 2, constructing a three-dimensional convolution feature extraction network, a candidate area generation network and a detection network, using the N cube structures and the corresponding labels obtained in the step 1, and performing fusion training on the three-dimensional convolution feature extraction network, the candidate area generation network and the detection network by using a cross training method to obtain the three-dimensional convolution feature extraction network, the candidate area generation network and the detection network for target detection in the video;
step 3, reading a video to be detected, decomposing the video to be detected into continuous M 'frame images, and stacking a plurality of frames before and after each frame image to obtain M cube structures, wherein M' is M;
step 4, one of the M cubic structures obtained in the step 3 is taken, a three-dimensional convolution feature extraction network is used, and features of the cubic structures are extracted to obtain corresponding feature maps;
step 5, inputting the characteristic diagram obtained in the step 4 into a candidate area generation network, predicting a candidate area possibly provided with a target, and obtaining the coordinate x of the candidate areap,yp,wp,hpAnd the probability P of the presence of an objectis,Pnot,PisProbability of existence of object, PnotProbability of absence of object, xp,ypIs the horizontal and vertical coordinate of the center point of the candidate region, wp,hpWidth and height of the candidate region;
step 6, setting the probability P of the existence of the targetisA threshold P _ threshold, a probability P that an object will be presentisMapping the area larger than the set threshold value P _ threshold onto the characteristic diagram obtained in the step 4;
step 7, performing area standard pooling on the area mapped to the feature map obtained in the step 6, and pooling candidate areas with different sizes into a feature map with a fixed size;
step 8, for each feature map with fixed size obtained in the step 7, using a detection network to classify each feature map with fixed size and perform regression fine adjustment on a boundary box, so as to obtain the classification category of the target, the probability P that the target belongs to the category and the coordinates x, y, w, h of the boundary box of the target, wherein P is the probability that the target belongs to the category, x and y are the horizontal and vertical coordinates of the center point of the boundary box of the target, and w and h are the width and height of the boundary box of the target;
step 9, filtering the detected targets with higher overlapping degree by adopting non-maximum inhibition, calculating the ratio of the area of the intersection part of the regions where the targets are located to the area of the union part of the detected targets of each type, and only keeping the detection result with the maximum probability P of the targets belonging to the category when the ratio of the area of the intersection part to the area of the union part exceeds a specified threshold IOU _ threshold, and filtering other detection results;
step 10, repeating the processes from step 4 to step 9 for the M cubic structures obtained in step 3, and respectively detecting to obtain a detection result of each frame of the image sequence in the video;
the process of the step 2 is as follows:
step 21, stacking a plurality of layers of three-dimensional convolution layers and three-dimensional pooling layers to construct a three-dimensional convolution feature extraction network, training the three-dimensional convolution feature extraction network by using a sports 1M database as a training sample aiming at a video classification task, and taking an obtained weight as an initial weight of the three-dimensional convolution feature extraction network;
step 22, constructing a candidate area generation network by using the two-dimensional convolution layer and the full-connection layer, and using the randomly initialized weight as an initial weight of the candidate area generation network;
step 23, constructing a detection network, wherein the detection network is composed of a classification sub-network and a regression sub-network, the structures of the classification sub-network and the regression sub-network are all full connection layers, and the weight value of random initialization is used as an initial weight value;
step 24, using the N cube structures and the corresponding labels obtained in step 1, training the candidate region generation network obtained in step 22 and the three-dimensional convolution feature extraction network obtained in step 21, wherein a loss function of the training is Lrpn=LP+LregWherein L isPGenerating a cross entropy, L, of the probability of the presence of the target and the true value of the tag output by the network for the candidate regionregGenerating a square sum of the coordinate offset of the candidate area output by the network and the coordinate offset of the target area in the label for the candidate area;
step 25, training the detection network obtained in the step 23 and the three-dimensional convolution feature extraction network obtained in the step 21, wherein the trained loss function is the weighted sum of the detection network output classification result loss and the coordinate regression loss;
step 26, repeating step 24 and step 25 several times until the loss functions in step 24 and step 25 are stable;
in step 5, the process of generating the candidate region is as follows:
step 51, for the feature map obtained by the three-dimensional convolution feature extraction network in the step 4, sliding on the feature map by using a two-dimensional convolution kernel with the size of 3 × 3, performing convolution calculation, and obtaining a 512-dimensional vector at each position where the convolution kernel slides;
and step 52, setting 9 anchor boxes as reference at each position of the convolution kernel sliding, setting the aspect ratio of the anchor boxes to be three proportions according to 1:2, 1:1 and 2:1, and setting the area size to be 1282、2562、5122The pixel has three sizes, and the center point of the anchor frame is the center of the sliding window;
step 53, outputting the 512-dimensional vectors obtained at each position where the convolution kernel slides in the step 51 through a full-connection network to 9 6-dimensional vectors; denotes an offset d representing the centroid coordinates, length and width of the candidate region with respect to the anchor block set in step 52x,dy,dh,dwAnd the probability P of the presence or absence of an objectis,PnotWherein: dx=(xp-xa)/wa,dy=(yp-ya)/ha,dh=log(hp/ha),dw=log(wp/wa),xp,yp,wp,hpRepresenting the coordinates of the center point of the candidate region, width and height, xa,ya,ha,waCoordinates of center point of anchor block, height and width, Pis,PnotPerforming normalization processing by using a softmax function to represent the probability of whether the target exists;
step 54, the offset d obtained from step 53x,dy,dh,dwCoordinate of center point, length and width x of anchor block set in step 52a,ya,ha,waCalculating the actual center point coordinates, width and height x of the generated candidate regionp,yp,wp,hp
2. The method for detecting the target in the video based on the three-dimensional convolutional network as claimed in claim 1, wherein: in step 1, the method for obtaining the cubic structure is as follows:
decomposing a video of a training sample into a continuous N 'frame image sequence, taking forward one frame image and backward one frame image of each frame image of the N' frame image sequence so as to receive certain time context information, and stacking the 2l +1 frame images to form a cubic structure with the size of W multiplied by H multiplied by (2l +1), wherein W is the width of the image, H is the height of the image, and 2l +1 represents the frame number of the stacked images; when the number of forward or backward image frames is less than l frames at the beginning and end of the image sequence, the size of the cube is still W × H × (2l +1) using a zero padding method, resulting in N cube structures with a size W × H × (2l +1), N ═ N.
3. The method for detecting the target in the video based on the three-dimensional convolutional network as claimed in claim 1, wherein: in the step 4, the structure of the three-dimensional convolution feature extraction network is as follows:
the overall structure of the three-dimensional convolution characteristic extraction network comprises a plurality of layers of nested three-dimensional convolution layers and three-dimensional pooling layers; the convolution kernel of the three-dimensional convolution is a tensor having three dimensions of length, width and height; in the output signature, the response output at spatial coordinates (a, b, c) is calculated by:
Figure FDA0002353733550000031
in the above formula, WijgIs the weight of the convolution kernel at position (i, j, g), X(a+i)(b+j)(c+g)Is the value of the input cube unit at position (a + i, b + j, c + g), v is the bias term, sw, sh, sl are the width and height of the three-dimensional convolution kernel, H, HabcIs the response output at spatial coordinates (a, b, c) and f is the activation function.
4. The method for detecting the target in the video based on the three-dimensional convolutional network as claimed in claim 1, wherein: in step 7, the regional standard pooling process is as follows:
step 71, representing the size of the region to be pooled as m × n, dividing the region to be pooled into 7 × 7 small lattices with the size of about m/7 × n/7, and rounding up approximately when m/7 or n/7 cannot be rounded up;
step 72, in each small grid divided in step 71, features in the small grid are pooled into 1 × 1 dimension by using the maximum pooling method, and feature areas with different sizes are pooled into a 7 × 7 dimension feature map with a fixed size.
5. The method for detecting the target in the video based on the three-dimensional convolutional network as claimed in claim 1, wherein: in step 8, the process of classifying the candidate region and fine-tuning the bounding box by the detection network is as follows:
step 81, flattening the feature graph with fixed size obtained in the step 7 into one-dimensional vectors, and respectively inputting the one-dimensional vectors into a classification sub-network and a regression sub-network;
step 82, the classification sub-network outputs n + 1-dimensional vectors { p over two layers of full connection1,p2,…,pn+1},p1,p2,…,pnRepresenting the probability that the candidate region belongs to each of the n classes of objects, pn+1Representing the probability that the candidate area belongs to the background, using a softmax function as an activation function by a network output layer, wherein n is the number of categories of the target to be detected;
step 83, the bounding box regression subnet outputs 4-dimensional vector t through two layers of full connectionx,ty,tw,thDenotes the offset of the target bounding box with respect to the candidate region, tx=(x-xp)/wa,ty=(y-yp)/ha,tw=log(w/wa),th=log(h/ha) X, y are horizontal and vertical coordinates of the center point of the boundary box of the target, w, h are the width and height of the boundary box of the target, and xp,yp,wp,hpCoordinates, width and height of a central point of the candidate area are obtained;
step 84, solving the n + 1-dimensional vector { p obtained in step 821,p2,…,pn+1Of the maximum values, if the maximum value is Pn+1Indicating that the candidate region is background, no output is performed,otherwise, judging the category of the target according to the maximum value and according to the { t }x,ty,tw,thCalculating the coordinates x, y, w, h of the bounding box of the target, and using the n + 1-dimensional vector { p }1,p2,…,pn+1The maximum value in is taken as the probability P that the target belongs to the class.
CN201910041920.0A 2019-01-16 2019-01-16 Target detection method in video based on three-dimensional convolution network Active CN109829398B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910041920.0A CN109829398B (en) 2019-01-16 2019-01-16 Target detection method in video based on three-dimensional convolution network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910041920.0A CN109829398B (en) 2019-01-16 2019-01-16 Target detection method in video based on three-dimensional convolution network

Publications (2)

Publication Number Publication Date
CN109829398A CN109829398A (en) 2019-05-31
CN109829398B true CN109829398B (en) 2020-03-31

Family

ID=66860338

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910041920.0A Active CN109829398B (en) 2019-01-16 2019-01-16 Target detection method in video based on three-dimensional convolution network

Country Status (1)

Country Link
CN (1) CN109829398B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287826B (en) * 2019-06-11 2021-09-17 北京工业大学 Video target detection method based on attention mechanism
CN110264457B (en) * 2019-06-20 2020-12-15 浙江大学 Welding seam autonomous identification method based on rotating area candidate network
CN110334752B (en) * 2019-06-26 2022-11-08 电子科技大学 Irregular-shape object detection method based on trapezoidal convolution
CN110473284B (en) * 2019-07-29 2021-02-12 电子科技大学 Moving object three-dimensional model reconstruction method based on deep learning
CN110533691B (en) * 2019-08-15 2021-10-22 合肥工业大学 Target tracking method, device and storage medium based on multiple classifiers
CN111199199B (en) * 2019-12-27 2023-05-05 同济大学 Action recognition method based on self-adaptive context area selection
CN111160255B (en) * 2019-12-30 2022-07-29 成都数之联科技股份有限公司 Fishing behavior identification method and system based on three-dimensional convolution network
CN111144376B (en) * 2019-12-31 2023-12-05 华南理工大学 Video target detection feature extraction method
CN111310609B (en) * 2020-01-22 2023-04-07 西安电子科技大学 Video target detection method based on time sequence information and local feature similarity
CN111178344B (en) * 2020-04-15 2020-07-17 中国人民解放军国防科技大学 Multi-scale time sequence behavior identification method
CN111624659B (en) * 2020-06-05 2022-07-01 中油奥博(成都)科技有限公司 Time-varying band-pass filtering method and device for seismic data
CN112016569A (en) * 2020-07-24 2020-12-01 驭势科技(南京)有限公司 Target detection method, network, device and storage medium based on attention mechanism
CN112215123B (en) * 2020-10-09 2022-10-25 腾讯科技(深圳)有限公司 Target detection method, device and storage medium
CN112613428B (en) * 2020-12-28 2024-03-22 易采天成(郑州)信息技术有限公司 Resnet-3D convolution cattle video target detection method based on balance loss
CN112733747A (en) * 2021-01-14 2021-04-30 哈尔滨市科佳通用机电股份有限公司 Identification method, system and device for relieving falling fault of valve pull rod
CN115082713B (en) * 2022-08-24 2022-11-25 中国科学院自动化研究所 Method, system and equipment for extracting target detection frame by introducing space contrast information

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975941A (en) * 2016-05-31 2016-09-28 电子科技大学 Multidirectional vehicle model detection recognition system based on deep learning
CN107145889A (en) * 2017-04-14 2017-09-08 中国人民解放军国防科学技术大学 Target identification method based on double CNN networks with RoI ponds
CN107506740A (en) * 2017-09-04 2017-12-22 北京航空航天大学 A kind of Human bodys' response method based on Three dimensional convolution neutral net and transfer learning model
CN107808150A (en) * 2017-11-20 2018-03-16 珠海习悦信息技术有限公司 The recognition methods of human body video actions, device, storage medium and processor
CN108537286A (en) * 2018-04-18 2018-09-14 北京航空航天大学 A kind of accurate recognition methods of complex target based on key area detection
CN108805083A (en) * 2018-06-13 2018-11-13 中国科学技术大学 The video behavior detection method of single phase

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107527363B (en) * 2016-06-20 2022-01-25 青岛海尔智能技术研发有限公司 Refrigerating device storage management system and refrigerating device
CN106203283A (en) * 2016-06-30 2016-12-07 重庆理工大学 Based on Three dimensional convolution deep neural network and the action identification method of deep video
US10366292B2 (en) * 2016-11-03 2019-07-30 Nec Corporation Translating video to language using adaptive spatiotemporal convolution feature representation with dynamic abstraction
CN108399380A (en) * 2018-02-12 2018-08-14 北京工业大学 A kind of video actions detection method based on Three dimensional convolution and Faster RCNN

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975941A (en) * 2016-05-31 2016-09-28 电子科技大学 Multidirectional vehicle model detection recognition system based on deep learning
CN107145889A (en) * 2017-04-14 2017-09-08 中国人民解放军国防科学技术大学 Target identification method based on double CNN networks with RoI ponds
CN107506740A (en) * 2017-09-04 2017-12-22 北京航空航天大学 A kind of Human bodys' response method based on Three dimensional convolution neutral net and transfer learning model
CN107808150A (en) * 2017-11-20 2018-03-16 珠海习悦信息技术有限公司 The recognition methods of human body video actions, device, storage medium and processor
CN108537286A (en) * 2018-04-18 2018-09-14 北京航空航天大学 A kind of accurate recognition methods of complex target based on key area detection
CN108805083A (en) * 2018-06-13 2018-11-13 中国科学技术大学 The video behavior detection method of single phase

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Using Gabor Filter in 3D Convolutional Neural Networks for Human Action Recognition;Li, Jiakun 等;《PROCEEDINGS OF THE 36TH CHINESE CONTROL CONFERENCE (CCC 2017)》;20170728;第11139-11144页 *
一种基于三维卷积网络的暴力视频检测方法;宋伟 等;《技术研究》;20171231(第12期);第54-60页 *

Also Published As

Publication number Publication date
CN109829398A (en) 2019-05-31

Similar Documents

Publication Publication Date Title
CN109829398B (en) Target detection method in video based on three-dimensional convolution network
CN108510467B (en) SAR image target identification method based on depth deformable convolution neural network
US9846946B2 (en) Objection recognition in a 3D scene
CN109903331B (en) Convolutional neural network target detection method based on RGB-D camera
CN112084869B (en) Compact quadrilateral representation-based building target detection method
CN111462200A (en) Cross-video pedestrian positioning and tracking method, system and equipment
CN107767400B (en) Remote sensing image sequence moving target detection method based on hierarchical significance analysis
EP2874097A2 (en) Automatic scene parsing
CN110309842B (en) Object detection method and device based on convolutional neural network
JP6397379B2 (en) CHANGE AREA DETECTION DEVICE, METHOD, AND PROGRAM
CN112435338B (en) Method and device for acquiring position of interest point of electronic map and electronic equipment
JP5833507B2 (en) Image processing device
JP6095817B1 (en) Object detection device
CN108229416A (en) Robot SLAM methods based on semantic segmentation technology
CN110929649B (en) Network and difficult sample mining method for small target detection
CN114972968A (en) Tray identification and pose estimation method based on multiple neural networks
CN106407978B (en) Method for detecting salient object in unconstrained video by combining similarity degree
CN111882586A (en) Multi-actor target tracking method oriented to theater environment
CN111323024A (en) Positioning method and device, equipment and storage medium
CN114926747A (en) Remote sensing image directional target detection method based on multi-feature aggregation and interaction
Ferguson et al. A 2d-3d object detection system for updating building information models with mobile robots
CN112396036A (en) Method for re-identifying blocked pedestrians by combining space transformation network and multi-scale feature extraction
CN116109950A (en) Low-airspace anti-unmanned aerial vehicle visual detection, identification and tracking method
CN112926426A (en) Ship identification method, system, equipment and storage medium based on monitoring video
CN108256444B (en) Target detection method for vehicle-mounted vision system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant