CN110889375A

CN110889375A - Hidden and double-flow cooperative learning network and method for behavior recognition

Info

Publication number: CN110889375A
Application number: CN201911189752.6A
Authority: CN
Inventors: 周书仁
Original assignee: Changsha University of Science and Technology
Current assignee: Changsha University of Science and Technology
Priority date: 2019-11-28
Filing date: 2019-11-28
Publication date: 2020-03-17
Anticipated expiration: 2039-11-28
Also published as: CN110889375B

Abstract

The invention discloses a hidden double-flow cooperative learning network and a method for behavior recognition, wherein the hidden double-flow cooperative learning network comprises a hidden double-flow model and a cooperative learning model; the hidden double-flow model comprises a space flow for extracting discriminant static characteristics and a hidden time flow for obtaining motion characteristics; the collaborative learning model is used for optimizing static and motion characteristics, adaptively learning the fusion weight of each video category and finally obtaining a prediction result. The invention can directly obtain the category of the video through the frame sequence of the video, and mutually enhance the time and space characteristics by capturing the interaction of the static information and the motion information through cooperative learning, thereby saving the storage space required by the prior method for storing the optical flow image in advance.

Description

Hidden and double-flow cooperative learning network and method for behavior recognition

Technical Field

The invention relates to the technical field of video processing, in particular to a hidden double-flow cooperative learning network and a method for behavior recognition.

Background

Behavior recognition is the recognition of different actions from clips (2D frame sequences) of a piece of video. Behavior recognition appears to be an extension of the image classification task to multiple frames, and then aggregates the predictions from each frame. Despite the great success of image classification, video behavior recognition continues to progress slowly.

As internet video traffic accounts for a higher proportion of internet traffic and most videos are human subject, in such a situation, comprehension analysis of videos is urgently needed. Video behavior recognition is a very important task and has wide applications, such as video search, intelligent monitoring, human-computer interaction and elderly care. Video motion recognition is a central problem in computer vision. In recent years, human motion recognition in video has advanced significantly. First, conventional manual frame extraction, such as the modified dense track approach (IDT), is the best, most stable, and most reliable method before deep learning is applied in the field, but this method is slow. Convolutional Neural Networks (CNNs) are typically orders of magnitude faster than IDT methods.

With the development of deep Convolutional Neural Networks (CNN), it achieved the most advanced performance in the task of image recognition. Many studies have designed effective deep convolutional neural networks for motion recognition.

The existing deep learning methods are mainly divided into two types:

the method comprises the following steps of firstly cutting a video into a frame sequence, then calculating an optical flow image of the video by using the frame sequence, designing two convolutional neural networks (a spatial stream convolutional neural network and a temporal stream convolutional neural network), convolving the video frame image by using the spatial stream convolutional neural network (spatial stream convolutional net), extracting features (spatial features and static appearance information) of the frame image, convolving the video optical flow image by using the temporal stream convolutional neural network (temporal stream convolutional net), extracting features (temporal features and motion information) of the optical flow image, separately training the two streams, and finally simply fusing the two streams to output a prediction result. The method of the invention is based on a dual-flow framework.

Secondly, a single-stream method frame, the theme of the method is that firstly, a video is cut into a frame sequence, then a 3D convolution neural network is designed, the frame sequence is directly put into the 3D convolution neural network to directly extract the space-time characteristics of the video, and then the space-time characteristics are utilized to classify the behavior types in the video.

The existing method adopts a double-flow Convolutional neural network (Two-Stream Convolutional network) to perform behavior recognition, and the specific idea is as follows: video consists of two parts, a spatial part and a temporal part. In the space part, the video is superposed in the form of a single frame, and appearance information such as scenes, objects and the like of the video is contained. In the temporal part, the variation of the motion of the frames contains the motion in the camera (viewer) and the video. Therefore, the method designs a framework for video behavior recognition, and divides the framework into two streams, namely a spatial stream and a Temporal stream (Temporal ConvNet). The spatial stream and the temporal stream are both a deep convolutional network, and finally they are merged by softmax. It considers two fusion methods: one is to average; the other is training on a multi-classification linear SVM, and a score is calculated using softmax normalized by L2.

The method comprises the following specific steps:

the method comprises the following steps: the video is cut into frames and the optical flow is extracted.

Concept of optical flow: the method is a method for calculating the motion information of an object between adjacent frames by finding the corresponding relation between the previous frame and the current frame by using the change of pixels in an image sequence on a time domain and the correlation between the adjacent frames. The optical flow is due to movement of the foreground objects themselves in the scene, movement of the camera, or both.

Step two:

the input of the spatial stream is to randomly pick out any frame in a given video; then, the probability distribution value is obtained by connecting the convolution layer and the full connection layer into softmax.

The input of the time stream is to select the time of any frame in the video and then superpose the N frames behind the selected time into an optical flow stack to enter training: and the probability distribution value is obtained by connecting the network layers to softmax.

The input of the temporal stream is formed by stacking the optical flow displacement fields between several successive frames. The stacking principle is as follows:

the dense optical flow can be seen as a set of displacement vector fields d between successive frames t and t +1 in pairs_t，d_t(u, v) point (u, v) indicating the t-th frame is moved toThe displacement vector of the corresponding point in the next frame t + 1. The displacement vector is divided into two directions, the horizontal direction

In the vertical direction

Can be combined with

And

viewed as two channels of an image, the invention stacks the flow path of L successive frames in order to represent motion across a series of frames

To form a total of 2L input channels. Let w and h be the width and height of the video frame; then adding I_τThe size is (w × h × 2L) time-stream convolutional neural network. The superimposed optical flow calculation formula is as follows:

for arbitrary points (u, v), I_τ(u，v，c)；c＝[1，2L]The motion of the point over the sequence of L frames is encoded. All point aggregation is the flow of the stack.

Step three: fusion

The fusion of the fractional values of the two streams (streams) forms the result of the final classification.

The fusion method comprises the following steps: 1. a weighted average score; 2. support Vector Machine (SVM) methods. The experimental result shows that the SVM fusion mode is better. The SVM method is a linear programming method.

However, the existing method has the following disadvantages: 1. depending on the optical flow to be extracted from the video in advance, the optical flow features are relearned for recognition of the motion, reducing the efficiency of the entire network. 2. And finally, simple weighted fusion is carried out, the interaction of the spatial features and the temporal features cannot be captured, and the condition that one stream fails and the other stream succeeds exists, so that the overall recognition accuracy is influenced.

Disclosure of Invention

The invention mainly aims to provide a hidden double-flow cooperative learning network and a method for behavior recognition, and aims to solve the problems that the whole network efficiency is low and the recognition accuracy is influenced in the conventional behavior recognition method.

In order to achieve the above object, the present invention provides a hidden double-flow collaborative learning network for behavior recognition, where the hidden double-flow collaborative learning network includes a hidden double-flow model and a collaborative learning model; the hidden double-flow model comprises a space flow for extracting discriminant static characteristics and a hidden time flow for obtaining motion characteristics; the collaborative learning model is used for optimizing static and motion characteristics, adaptively learning the fusion weight of each video category and finally obtaining a prediction result.

Preferably, the spatial stream is used to input a static frame of video into a convolutional neural network, capturing a static feature of a picture.

Preferably, the hidden temporal stream is divided into an optical flow estimation portion and a feature extraction portion for extracting motion features in the optical flow estimated by the optical flow estimation portion, wherein a network of the optical flow estimation portion calculates a plurality of penalties on a plurality of scales, the penalty on each scale being a weighted sum of a standard pixel reconstruction penalty, a smoothness penalty, and a region-based structural similarity penalty.

Preferably, the function of the standard pixel reconstruction loss is:

wherein,

is the estimated optical flow, h, in the horizontal and vertical directions of the pixel point (a, b)And w represents the image I₁And I₂Height and width of;

the smoothness loss function is:

wherein,

and

is the estimated optical flow field V in each direction^xThe gradient of (a) of (b) is,

and

is optical flow field V^yA gradient of (a);

the structural similarity loss function is:

wherein,

and

are respectively an image I₁And I₂Local mass of, mu_p1、μ_p2Is an image block

And

average value of (a) ("sigma_p1And σ_p2Is the variance, σ, of two image blocks_p1p2Is the covariance, c₁And c₂Are two constants.

Preferably, the step of optimizing static and motion features in the collaborative learning model comprises:

(1) initializing optimization coefficients on frame features to z^s，z^sAll N elements are set to be 1/N;

(2) repeating the steps for one time;

(3) feature V of frame^sMerge into a single vector as a video feature

(4) By passing

And

using O^sOptimizing optical flow characteristics and obtaining an optimization coefficient z on the optical flow characteristics^m；

(5) By passing

Characterizing the optical flow V^mMerging into a single vector as a video feature;

(6) using O^mOptimizing frame characteristics and obtaining an optimization coefficient z on the frame characteristics^s；

(7) Until the loss function converges;

(8) returning optimized frame features

And optimized optical flow features

Wherein the frame feature is a static feature, the optical flow feature is a motion feature, and the optical flow feature is

1 is a vector with all elements 1, O^sIs a video feature, z, of a combination of frame features obtained from time t-1^mIs a learning optimization coefficient on the optical flow characteristics, O^mIs a video feature merged from the optical flow, W^m、

And

is a weight parameter, a frame characteristic

Preferably, the step of adaptively learning the fusion weight of each video category in the collaborative learning model to obtain the prediction result includes:

different fusion weights for different classes of static and motion streams are learned adaptively, with the final classification result determined by the highest fusion score.

In order to achieve the above object, the present invention provides a hidden-double-flow cooperative learning method for behavior recognition, which includes the following steps:

the input video is firstly decomposed into a frame sequence, then the frame sequence is respectively sent into the spatial stream extraction discriminant static feature and the implicit time stream extraction discriminant static feature of the implicit double-stream cooperative learning network to obtain the motion feature, the cooperative learning network is carried out to optimize the static and motion features after the feature is obtained, the fusion weight of each video category is adaptively learned, and finally the prediction result is obtained.

The invention provides a novel network architecture, which hides the step of extracting the optical flow in a network structure, thereby greatly accelerating the speed of the network; on the basis of double flows, the double-flow cooperation module captures the interaction of the time characteristic and the space characteristic so as to improve the identification precision of the whole network.

Drawings

FIG. 1 is a flow chart of a hidden-double-flow cooperative learning method for behavior recognition according to the present invention;

FIG. 2 is a block diagram of a hidden-dual-flow collaborative learning network for behavior recognition according to the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1 to fig. 2, the hidden-double-flow cooperative learning method includes two models: the method comprises the following specific steps of: the method comprises the steps of decomposing an input video into a frame sequence, sending the frame sequence into a Spatial stream (Spatial stream CNN) to extract a discriminant static feature and a Hidden time stream (high temporal CNN, Spatial stream CNN) to obtain a motion feature, performing collaborative learning network optimization on the static and motion features after the features are obtained, learning fusion weight of each video category in a self-adaptive manner, and finally obtaining a prediction result. The hidden-dual-flow model and the collaborative learning model are specifically described below.

One-hidden double-flow module

The invention hopes to learn not only static appearance characteristics but also characteristics containing motion information from a video frame sequence, so as to be used as a basis for judging the motion type of the video. The static frame of the video is input into the convolutional neural network, and the motion recognition of the static image can be effectively realized, so the function of the space flow of the invention is the same as that of the double-flow network, and the space flow is used for capturing the static appearance information of the picture. FlowNet demonstrates that optical flow can be estimated with CNN, and the present invention is intended to learn optical flow information of a frame sequence using CNN architecture and to contribute to human behavior recognition task. The specific details are as follows:

space flow of A

Static appearance features (color, lighting, texture, contours, etc.) are themselves a useful cue, as some actions are closely related to a particular object and scene. The input of the spatial stream ConvNet of the present invention is a still frame of a video, and motion recognition of a still image can be effectively realized. In fact, the behavior classification of the stationary frames (spatial recognition stream) is itself quite competitive. Since spatial convolutional networks are essentially an image classification architecture, we can pre-train the network on large image classification datasets (e.g., ImageNet challenge datasets) based on the latest advances in large-scale image recognition methods. Table 1 is the architecture of the spatial flow network in the implicit dual flow model (set M to the number of corresponding dataset classes).

TABLE 1

Hidden time stream B

Although there are many actions that can be discriminated using only a single frame image, there are some actions that depend on timing information. So the time stream of the original Two-stream network takes the optical flow image as input. In the conventional method, an optical flow image is obtained by operating a video by a TVL1 or the like. The optical flow image contains information which is helpful for the behavior recognition task, but the original method needs to extract the optical flow information in advance, the extraction speed is slow, and the optical flow image containing the optical flow information needs additional storage space to be stored.

The invention regards optical flow estimation as an image reconstruction problem, and learns optical flow information which is helpful for an identification task of a frame sequence by using a time flow CNN. The present invention seeks to generate an effective optical flow of adjacent frames by CNN. A pair of adjacent frames I₁And I₂As an input, if the estimated optical flow and the next frame can be used to reconstruct the current frame, it is proven that the network has learned the motion information.

The hidden temporal flow is divided into an optical flow estimation section and a feature extraction section. The details of the optical flow estimation part network are shown in table 2, and the network structure of the feature extraction part is the same as that of the spatial flow network.

TABLE 2

The invention computes multiple losses at multiple scales over a network of optical flow estimation sections. Specifically, three loss functions are utilized to help produce better optical flow, the loss functions being as follows:

the standard pixel reconstruction error function is:

is the estimated optical flow in the horizontal and vertical directions of the pixel point (a, b), h and w representing the image I₁And I₂Height and width of (a). In order to reduce the influence of abnormal values, the invention adopts Charbonier Peaalty function rho (x) ═ x²+ε²)^α(an L1 loss variant was first used as a loss function in LapSRN).

L_smFor the smoothness loss function, the aperture problem that leads to blurring when estimating motion in non-textured areas is solved,

and

is the estimated optical flow field V in each direction^xThe gradient of (a) is, likewise,

and

then the optical flow field V^yIs the same as in equation (1).

Structural Similarity (SSIM) loss function, which helps us learn the structure of the frame, where

And

are respectively an image I₁And I₂Is set to 8x 8. Mu.s_p1，μ_p2Is an image block

And

average value of (a) ("sigma_p1And σ_p2Is the variance, σ, of two image blocks_p1p2Is the covariance, c₁And c₂Are two constants that stabilize the division. Set to 0.0001 and 0.001, respectively, in the experiment.

For comparing two images I₁And I₁The similarity between the two is defined as L in the SSIM loss function of the present invention_ssWhere N is the number of local blocks that can be extracted from the image and N is the index of the local block.

L_s＝λ₁L_p+λ₁L_ss+λ₃L_sm(5)

The penalty per scale s is therefore a weighted sum of the pixel reconstruction penalty, the piecewise smoothing penalty and the region-based SSIM penalty,

δ_sthe settings are to balance the losses on each scale and are identical in number.

And the feature extraction part is also similar to the CNN structure of the spatial stream. The optical flow estimated by the optical flow estimation portion needs to be normalized before being sent to the CNN where the features are extracted. A limit of more than 20 pixels is first limited to 20 pixels. It is then normalized to range between 0 and 255. This normalization is important for good time-stream performance. The last temporal stream extracts features that contain optical flow information.

Two, cooperation learning module

For both streams and their derivatives, careful inspection of their models reveals that for most misclassification cases, one stream usually fails, while the other is still correct, affecting the overall recognition accuracy. Therefore, it is not sufficient to simply average the output of the classifier layers. Rather, the present invention seeks to facilitate spatial and temporal cues in relation to each other. In order to capture the interaction of spatial (static) information and temporal (motion) information, the invention expects the static features and the motion features to interact, so the collaborative learning module of the invention mutually guides optimization of the static features and the motion features by using the static and motion information of which the static and motion information have a symmetrical structure.

At time t, optimizing optical flow characteristics by using the frame characteristics, wherein the specific formula is as follows:

wherein: optical flow characteristics:

And

is a weight parameter.

At time t +1, frame features are optimized using optical flow features

In the collaborative learning module, inputting: extracting frame and optical flow characteristics from the hidden double-flow model; and (3) outputting: optimized frame features

And optical flow features

The specific algorithm steps are as follows:

1. initializing optimization coefficients on frame features to z^s，z^sAll N elements are set to be 1/N;

2. repeating the steps for one time;

3. feature V of frame^sMerge into a single vector as a video feature

4. Using O by equations (7) and (8)^sOptimizing optical flow characteristics and obtaining an optimization coefficient z on the optical flow characteristics^m；

5. Optical flow feature V by equation (9)^mMerging into a single vector as a video feature;

6. using O^mOptimizing frame characteristics and obtaining an optimization coefficient z on the frame characteristics^s；

7. Until the loss function converges;

8. returning optimized frame features

And optimized optical flow features

Since the predicted scores (static and motion) for each stream have been obtained, we can simply summarize these scores and obtain the final result. However, static and motion information contribute differently to different video categories. Some categories, such as "archery" and "smoke", should not be identified primarily from static frames (static information), while some categories contain significant motion, which is important for distinguishing categories, such as "walking and" crawling ". Thus, the present invention adaptively learns different fusion weights for different classes of static and motion streams.

Predictive score for ith training data in jth class

Where c represents the number of categories,

a score representing the mth stream of the ith training data in the jth category.

The present invention represents the fusion weight of the jth class as

Limiting

The fusion weight of each category is learned separately, and the weight W is obtained_j。

The objective function is as follows: (λ is set to 5 × 10)^-3)

p_jThe definition is as follows:

wherein N is_jIndicates the number of training data of the jth class,

where the jth element is 1 and the other elements are 0, the way to maximize Pj is to maximize the product of the jth column vectors of Wj and Sji, which also means minimizing Wj and S^k _iIs equal to j, is calculated as the product of the jth column vector of (k is not equal to j). Pj and Nj consider the relationship of positive and negative training data for Wj, respectively, and are parameters that balance the weights of the positive and negative samples. Equation (10) can then be transformed into:

the fusion weights are learned by linear programming, and for the test data the softmax score for each stream is first calculated and superimposed, as:

by the formula

And (3) prediction:

the final classification result is determined by the highest fusion score.

The following beneficial effects can be obtained through the scheme:

1. the hidden time stream can estimate the optical flow information of the video through the frame sequence of the video, and then the category of the video behavior is directly obtained by combining the spatial stream. The calculation cost required for extracting the optical flow image in advance is not required.

2. The invention can capture the interaction of static information (space flow) and motion information (hidden time flow) through a collaborative learning module, and mutually enhance the time and space characteristics. The identification precision of video behavior identification is improved.

3. The invention can save the storage space required by the prior method for storing the optical flow image in advance.

In the description herein, references to the description of the term "one embodiment," "another embodiment," or "first through xth embodiments," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, method steps, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. The hidden double-flow cooperative learning network for behavior recognition is characterized by comprising a hidden double-flow model and a cooperative learning model; the hidden double-flow model comprises a space flow for extracting discriminant static characteristics and a hidden time flow for obtaining motion characteristics; the collaborative learning model is used for optimizing static and motion characteristics, adaptively learning the fusion weight of each video category and finally obtaining a prediction result.

2. The implicit dual-stream collaborative learning network for behavior recognition according to claim 1, wherein the spatial stream is used to input static frames of video into a convolutional neural network, capturing static features of pictures.

3. The hidden-dual-flow collaborative learning network for behavior recognition according to claim 1, wherein the hidden temporal flow is divided into an optical flow estimation portion and a feature extraction portion for extracting motion features in the optical flow estimated by the optical flow estimation portion, wherein the network of optical flow estimation portions computes a plurality of penalties on a plurality of scales, the penalty on each scale being a weighted sum of a standard pixel reconstruction penalty, a smoothness penalty, and a region-based structural similarity penalty.

4. The implicit dual-flow collaborative learning network for behavior recognition according to claim 3, wherein the function of the standard pixel reconstruction loss is:

wherein,

is the estimated optical flow in the horizontal and vertical directions of the pixel point (a, b), h and w representing the image I₁And I₂Height and width of;

the smoothness loss function is:

wherein,

and

and

is optical flow field V^yA gradient of (a);

the structural similarity loss function is:

wherein,

and

And

5. The implicit dual-flow collaborative learning network for behavior recognition according to any of claims 1-4, wherein the step of optimizing static and motion features in the collaborative learning model comprises:

(2) repeating the steps for one time;

(3) feature V of frame^sMerge into a single vector as a video feature

(4) By passing

And

(5) By passing

Characterizing the optical flow V^mAre combined intoA single vector is used as a video feature;

(7) Until the loss function converges;

(8) returning optimized frame features

And optimized optical flow features

And

is a weight parameter, a frame characteristic

6. The implicit dual-flow collaborative learning network for behavior recognition according to any one of claims 1 to 4, wherein the step of adaptively learning the fusion weight of each video category in the collaborative learning model to obtain the prediction result comprises:

7. A hidden double-flow cooperative learning method for behavior recognition is characterized by comprising the following steps of:

an input video is firstly decomposed into a frame sequence, then the frame sequence is respectively sent into the spatial stream extraction discriminant static characteristics and the hidden time stream extraction discriminant motion characteristics of the hidden double-stream cooperative learning network according to any one of claims 1 to 6, after the characteristics are obtained, the cooperative learning network is carried out to optimize the static and motion characteristics, the fusion weight of each video category is adaptively learned, and finally, a prediction result is obtained.