CN110414367B

CN110414367B - Time sequence behavior detection method based on GAN and SSN

Info

Publication number: CN110414367B
Application number: CN201910599488.7A
Authority: CN
Inventors: 李致远; 桑农; 张士伟; 高常鑫
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2019-07-04
Filing date: 2019-07-04
Publication date: 2022-03-29
Anticipated expiration: 2039-07-04
Also published as: CN110414367A

Abstract

The invention discloses a time sequence behavior detection method based on GAN and SSN, belonging to the technical field of computer vision, and the method comprises the following steps: performing frame extraction and optical flow calculation on the video data, and performing normalization and data enhancement on each frame image or optical flow image; selecting a continuous time region with an action segment in the video data as an offer, and using a frame image corresponding to the selected offer as a training set and a test set; constructing a time sequence behavior detection model comprising a structured segmented network and a generated countermeasure network; inputting a training set and a test set into the time sequence behavior detection model for training to obtain a trained time sequence behavior detection model; and inputting the video to be recognized into the trained time sequence behavior detection model to obtain the behavior category existing in the video and the starting position and the ending position corresponding to the behavior. The invention improves the resolution capability of the network on the background and the behaviors, and has higher identification precision on the detection of the time sequence behaviors in the video.

Description

Time sequence behavior detection method based on GAN and SSN

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a time sequence behavior detection method based on GAN and SSN.

Background

With the rapid spread of the internet, a huge amount of video data is generated, and as one of the largest information carriers in the current society, videos are in a rapidly growing state, and it is urgent how to fully utilize the huge amount of data. Accordingly, the demand for analyzing, classifying, identifying, etc. video data is also growing sharply, and the time-series behavior detection has attracted increasing attention from the research community due to the numerous potential applications in surveillance, video analysis, and other fields. Sequential behavior detection is a subtask in the field of behavior detection, which detects human action instances from unclipped and potentially very long videos, and compared to behavior recognition, its prediction results output not only action categories but also precise start and end time points, thus being more challenging.

In real-world applications, large amounts of video data are typically arbitrarily long in time and large in space, containing many instances of motion and containing much irrelevant background information. Two mainstream methods of detecting motion have been proposed, handmade features and deep features. Before CNN-based algorithms are widely applied in the field of behavior recognition, hand-crafted features achieve the best performance in the thumb 2014 and 2015 challenges, commonly used features including Improved Dense Trajectories (iDT) and Fisher Vectors (FV). Meanwhile, manual fabrication can be combined with deep learning, and high-accuracy results can be achieved. There have also been some recent studies on the implementation of automatic feature extraction by a single-frame-based deep neural network, relying on a 2D Convolutional Neural Network (CNN), without considering motion information. However, obtaining motion information is important for motion modeling and determination of temporal boundaries. To model the temporal evolution of motion, many methods generate candidate temporal segments by sliding window or binary classification, which are then classified and identified. However, a disadvantage of these mainstream sliding window-based frameworks is that there is a large amount of redundant detection, which not only reduces the detection accuracy, but also affects its application.

Meanwhile, many behavior detection methods based on different scenes are proposed and have achieved high detection performance, however, most methods assume that the video is well cropped, wherein the interested actions last almost for the whole duration, so they do not need to consider the problem of localized action instances, and the network has poor ability to distinguish behaviors from backgrounds because the network itself cannot distinguish hard samples well in the training process.

Generally speaking, the existing time-series behavior detection method cannot capture the subtle difference between the behavior and the background, so that the behavior and the background problem cannot be effectively distinguished.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a time sequence behavior detection method based on GAN and SSN, and aims to solve the problem that the prior time sequence behavior detection has poor resolution capability on behaviors and backgrounds.

In order to achieve the above object, the present invention provides a time sequence behavior detection method based on GAN and SSN, including:

(1) dividing video data into a training set and a test set, and performing frame extraction and optical flow calculation on the training set and the test set;

(2) selecting some region segments as proposals for each video, and carrying out normalization and data enhancement processing on frame images or optical flow images contained in the proposals;

(3) constructing a time sequence behavior detection model;

the time sequence behavior detection model comprises a structured segmented network and a generation countermeasure network;

the structured segment network is used for extracting the characteristics of the images contained in the proposal, dividing the extracted characteristics into characteristics of a start stage, a behavior stage and an end stage according to a set proportion, and performing classification, boundary regression and integrity scoring according to the characteristics of the start stage, the behavior stage and the end stage;

the generation countermeasure network is used for generating the difficult-case characteristics which are the same as the characteristic dimensionality and size extracted by the structured segmented network, counting the difficult-case characteristics in the same distribution in a training set, and judging real or false characteristics according to the difficult-case characteristics and the characteristics extracted by the structured segmented network;

(4) inputting the training set and the test set into the time sequence behavior detection model for training and testing to obtain a finally trained time sequence behavior detection model;

(5) and inputting the video to be recognized into the trained time sequence behavior detection model to obtain the behavior categories existing in the video and the starting position and the ending position corresponding to various behaviors.

Further, the selecting some regional segments for each video as a proposal in the step (2) specifically includes:

(2.1) randomly generating a series of offers for each video;

(2.2) scoring the randomly generated proposal using a BNinception-based binary network;

(2.3) generating the proposal needed by the time-sequence behavior detection according to the proposal score by adopting a TAG algorithm.

Further, the step (2.3) specifically includes:

(2.3.1) inverting the offer score along the horizontal line and considering the offer with a score below the set score as an offer basin;

(2.3.2) starting from the current proposed basin, merging subsequent proposed basins until the proportion of the basin duration exceeding the total duration drops to a set threshold; the total duration is from the time the first proposed basin begins to the time the last proposed basin ends;

(2.3.3) merging the proposed basins and the separation region between the basins as a single proposal;

(2.3.4) performing steps (2.3.2) - (2.3.3) for each proposal, resulting in a plurality of proposals;

(2.3.5) the proposal with the degree of overlap of 0.95 is subjected to non-maximum inhibition, and the proposal required by the time-series behavior detection is obtained.

Further, the structured segment network comprises a proposed segmentation sub-network, a feature extraction sub-network, a boundary regression sub-network, a classification sub-network and an integrity judgment sub-network;

the proposal segmentation sub-network is used for expanding the selected proposal, dividing the proposal into a plurality of sections, and randomly extracting a frame image or an optical flow image from each section of the proposal; the feature extraction sub-network is used for extracting features of the extracted frame images or optical flow images and dividing the extracted features into start stage features, behavior stage features and end stage features according to a set proportion; the boundary regression sub-network is used for performing behavior boundary positioning regression according to the starting stage characteristic, the behavior stage characteristic and the ending stage characteristic; the classification sub-network is used for judging the behavior class according to the behavior phase characteristics; and the integrity judgment sub-network is used for carrying out behavior integrity grading according to the starting stage characteristic, the behavior stage characteristic and the ending stage characteristic.

Further, the feature extraction sub-network divides the extracted features into start features, behavior features and end features in a ratio of 2:5: 2.

Further, the loss function of the classification sub-network and the integrity determination sub-network is:

L_cls(c_i,b_i；p_i)＝-log P(c_i|p_i)-1_(ci≥1)log P(b_i|c_i,p_i)

wherein p is_iIs a proposal, c_iIs a class label, b_iRepresents p_iWhether complete, integrity P (b)_i|c_i,p_i) Only proposing p_iUsed when not considered part of the background;

the loss function of the boundary regression subnetwork is as follows:

if and only if c_i≥1&b_iWhen 1, a boundary regression subnetwork loss is calculated, where μ_iTo propose p_iRelative change from the nearest real behavior instance in the center of the two intervals, phi_iTo propose p_iThe logarithmic scale span of the two interval centers from the nearest real behavior instance.

Further, the generation countermeasure network includes a generator and an arbiter;

the generator is used for generating the feature dimension and the size which are the same as those of the feature extraction sub-network in the structured segment network, and counting the hard-case features which are distributed in the same way in the training set; the discriminator is used for judging real or false characteristics according to the difficult-to-sample characteristics generated by the generator and the characteristics extracted by the characteristic extraction sub-network in the structured segmented network, and meanwhile, judging the behavior type of the real characteristics.

Further, the generator comprises two fully-connected layers which are connected in sequence; the input of the generator is a vector of random normal distribution.

Furthermore, the number of neurons in the two fully-connected layers is 4096, and the length of the vector is 100.

Further, the feature matching penalty of the generator is:

wherein phi (-) denotes a feature extraction sub-network, psi (-) denotes a classification sub-network, G (-) denotes a generator, P_action＝{(x_sY) represents a training set of behavior windows, x_sRepresenting a behavior window, y representing a ground truth label;

the penalty function of the discriminator is:

L_D＝L_real+L_fake

wherein L is_realFor the classification loss of the actual sample, L_fakeLoss of false samples for generation;

L_fake＝E_z～noise[-log P_D(K+2|G(z))]

indicating the desire to discriminate as a behavior,

indicating a desire to discriminate as background,

{o₁,...,o_K+1is the prediction vector, x_nsAs a background window, E_z～noise[]Indicating the expectation of discrimination as noise, and K +2 represents a difficult example feature.

Through the technical scheme, compared with the prior art, the invention has the following beneficial effects:

(1) according to the method, the difficult-to-case characteristics which are the same as the characteristic dimensions and sizes extracted by the structured segmented network and are distributed in a training set are generated through the GAN network, so that the recognition capability of the model on the difficult-to-case samples is improved, the model can capture the slight difference between the behaviors and the background, the distinguishing capability of the model on the behaviors and the background is improved, and the positioning precision of the time sequence behaviors is improved;

(2) the method adopts the structured segmented network to perform segmented processing on the proposal, so that the model has context recognition capability on the behavior action in the video, and the recognition capability of the model on the behavior action is ensured.

Drawings

FIG. 1 is a flow chart of a method for detecting the time-series behavior based on GAN and SSN according to the present invention;

fig. 2 is a diagram of a time-series behavior detection model.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, an embodiment of the present invention provides a time sequence behavior detection method based on GAN and SSN, including:

specifically, selecting some regional segments as proposals for each video specifically includes:

(2.1) randomly generating a series of proposals for the video data;

specifically, 12 proposed explosal were taken for each video, where the ratio of foreground to background was Fg: bg is 3:9 (the overlap degree >0.7 is regarded as fg, the overlap degree <0.7 is regarded as Bg), and the network parameters are set as: batchsize ═ 3; the learning rate was 0.0001. The basic idea is to find those contiguous time regions with most highly active fragments as proposals and then use the TAG algorithm to generate the proposal needed for temporal behavior detection.

Specifically, the step (2.3) specifically includes:

Each frame image or optical flow image contained in the chosen proposal is normalized to 224 x 224 pixel size and randomly horizontally flipped with a probability of 0.5.

(3) Constructing a time sequence behavior detection model;

specifically, the time sequence behavior detection model of the present invention includes a structured segment network ssn (structural reconstruction network) and a generation countermeasure network gan (genetic adaptive network);

specifically, as shown in fig. 2, a segment network is structured, including a proposed segmentation sub-network, a feature extraction sub-network, a boundary regression sub-network, a classification sub-network, and an integrity judgment sub-network;

the proposal segmentation sub-network is used for expanding the selected proposal, dividing the proposal into a plurality of sections, and randomly extracting a frame of image from each section of the proposal; the feature extraction sub-network is used for extracting features of each frame of the extracted image and dividing the extracted features into a starting feature, a behavior feature and an ending feature according to the proportion of 2:5: 2; the boundary regression sub-network is used for performing behavior boundary positioning regression according to the starting stage characteristics, the behavior stage characteristics and the ending stage characteristics; the classification sub-network is used for judging the behavior category according to the behavior stage characteristics; the integrity judgment sub-network is used for carrying out behavior integrity grading according to the starting stage characteristic, the behavior stage characteristic and the ending stage characteristic;

generating a countermeasure network, including a generator and an arbiter; the generator is used for generating the feature dimension and the size which are the same as those of the feature extraction sub-network in the structured segmented network, and counting the hard case features which are distributed in the same way in the training set; the discriminator is used for judging real or false characteristics according to the difficult-case characteristics generated by the generator and the characteristics extracted by the characteristic extraction sub-network in the structured segmented network, and meanwhile judging the behavior types of the real characteristics;

as shown in fig. 2, the generator of the present invention includes two full connection layers FC1 and FC2 connected in sequence; the number of neurons of both fully-connected layers is 4096, and a vector with a length of 100 of random normal distribution is used as the input of the generator to output the difficult-to-sample characteristics.

(4) Inputting the training set and the test set into the time sequence behavior detection model for training to obtain a trained time sequence behavior detection model;

specifically, in the structured segment network part, the loss function is mainly divided into classification loss, behavior integrity loss and boundary regression loss, and the behavior classification sub-network and the integrity judgment sub-network jointly define uniform classification loss:

L_cls(c_i,b_i；p_i)＝-log P(c_i|p_i)-1_(ci≥1)log P(b_i|c_i,p_i)

the loss function of the boundary regression subnetwork is:

if and only if c_i≥1&b_iWhen 1, i.e. the proposal belongs to a behavior class and is complete, the boundary regression subnetwork loss is calculated, where μ_iTo propose p_iRelative change from the nearest real behavior instance in the center of the two intervals, phi_iTo propose p_iThe logarithmic scale span of the two interval centers from the nearest real behavior instance.

In the part of generating the countermeasure network, the loss function is mainly divided into feature similarity loss and classification loss, and the feature matching loss of the generator is defined as:

the arbiter determines whether the feature is a loss generated by the generator as defined by:

L_D＝L_real+L_fake

L_fake＝E_z～noise[-log P_D(K+2|G(z))]

indicating the desire to discriminate as a behavior,

indicating a desire to discriminate as background,

(5) And inputting the video to be recognized into the trained time sequence behavior detection model to obtain the behavior category existing in the video and the starting position and the ending position corresponding to the behavior.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A time sequence behavior detection method based on GAN and SSN is characterized by comprising the following steps:

(3) constructing a time sequence behavior detection model;

the structured segment network is used for extracting the characteristics of the images contained in the proposal, dividing the extracted characteristics into characteristics of a start stage, a behavior stage and an end stage according to a set proportion, and performing classification, boundary regression and integrity scoring according to the characteristics of the start stage, the behavior stage and the end stage; the structured segmentation network comprises a proposal segmentation sub-network, a feature extraction sub-network, a boundary regression sub-network, a classification sub-network and an integrity judgment sub-network;

the proposal segmentation sub-network is used for expanding the selected proposal, dividing the proposal into a plurality of sections, and randomly extracting a frame image or an optical flow image from each section of the proposal; the feature extraction sub-network is used for extracting features of the extracted frame images or optical flow images and dividing the extracted features into start stage features, behavior stage features and end stage features according to a set proportion; the boundary regression sub-network is used for performing behavior boundary positioning regression according to the starting stage characteristic, the behavior stage characteristic and the ending stage characteristic; the classification sub-network is used for judging the video behavior category according to the behavior stage characteristics; the integrity judgment sub-network is used for carrying out behavior integrity grading according to the starting stage characteristic, the behavior stage characteristic and the ending stage characteristic;

2. The method according to claim 1, wherein the step (2) of selecting some region segments for each video as proposals specifically comprises:

(2.1) randomly generating a series of offers for each video;

3. The method according to claim 2, wherein the step (2.3) specifically comprises:

4. The method of claim 1, wherein the sub-network of feature extraction divides the extracted features into start features, behavior features and end features in a ratio of 2:5: 2.

5. The method of claim 1, wherein the loss functions of the classification sub-network and the integrity judgment sub-network are as follows:

L_cls(c_i，b_i；p_i)＝-log P(c_i|p_i)-1_(ci≥1)log P(b_i|c_i，p_i)

wherein p is_iIs a proposal, c_iIs a class label, b_iRepresents p_iWhether complete, integrity P (b)_i|c_i，p_i) Only proposing p_iUsed when not considered part of the background;

the loss function of the boundary regression subnetwork is as follows:

6. The GAN and SSN-based time-series behavior detection method of any one of claims 1-5, wherein the generative countermeasure network comprises a generator and a discriminator;

7. The GAN and SSN based time series behavior detection method of claim 6, wherein the generator comprises two fully connected layers connected in sequence; the input of the generator is a vector of random normal distribution.

8. The method according to claim 7, wherein the number of neurons in both fully-connected layers is 4096, and the length of the vector is 100.

9. The method of claim 7, wherein the generator has a loss of feature matching as follows:

the penalty function of the discriminator is:

L_D＝L_real+L_fake

L_fake＝E_z～noise[-log P_D(K+2|G(z))]

indicating the desire to discriminate as a behavior,

indicating a desire to discriminate as background,

{o₁，...，o_K+1is the prediction vector, x_nsAs a background window, E_z～noise[]Indicating the expectation of discrimination as noise, and K +2 represents a difficult example feature.