CN110889375A - Hidden and double-flow cooperative learning network and method for behavior recognition - Google Patents

Hidden and double-flow cooperative learning network and method for behavior recognition Download PDF

Info

Publication number
CN110889375A
CN110889375A CN201911189752.6A CN201911189752A CN110889375A CN 110889375 A CN110889375 A CN 110889375A CN 201911189752 A CN201911189752 A CN 201911189752A CN 110889375 A CN110889375 A CN 110889375A
Authority
CN
China
Prior art keywords
flow
optical flow
video
hidden
static
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911189752.6A
Other languages
Chinese (zh)
Other versions
CN110889375B (en
Inventor
周书仁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changsha University of Science and Technology
Original Assignee
Changsha University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changsha University of Science and Technology filed Critical Changsha University of Science and Technology
Priority to CN201911189752.6A priority Critical patent/CN110889375B/en
Publication of CN110889375A publication Critical patent/CN110889375A/en
Application granted granted Critical
Publication of CN110889375B publication Critical patent/CN110889375B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/192Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references
    • G06V30/194References adjustable by an adaptive method, e.g. learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a hidden double-flow cooperative learning network and a method for behavior recognition, wherein the hidden double-flow cooperative learning network comprises a hidden double-flow model and a cooperative learning model; the hidden double-flow model comprises a space flow for extracting discriminant static characteristics and a hidden time flow for obtaining motion characteristics; the collaborative learning model is used for optimizing static and motion characteristics, adaptively learning the fusion weight of each video category and finally obtaining a prediction result. The invention can directly obtain the category of the video through the frame sequence of the video, and mutually enhance the time and space characteristics by capturing the interaction of the static information and the motion information through cooperative learning, thereby saving the storage space required by the prior method for storing the optical flow image in advance.

Description

Hidden and double-flow cooperative learning network and method for behavior recognition
Technical Field
The invention relates to the technical field of video processing, in particular to a hidden double-flow cooperative learning network and a method for behavior recognition.
Background
Behavior recognition is the recognition of different actions from clips (2D frame sequences) of a piece of video. Behavior recognition appears to be an extension of the image classification task to multiple frames, and then aggregates the predictions from each frame. Despite the great success of image classification, video behavior recognition continues to progress slowly.
As internet video traffic accounts for a higher proportion of internet traffic and most videos are human subject, in such a situation, comprehension analysis of videos is urgently needed. Video behavior recognition is a very important task and has wide applications, such as video search, intelligent monitoring, human-computer interaction and elderly care. Video motion recognition is a central problem in computer vision. In recent years, human motion recognition in video has advanced significantly. First, conventional manual frame extraction, such as the modified dense track approach (IDT), is the best, most stable, and most reliable method before deep learning is applied in the field, but this method is slow. Convolutional Neural Networks (CNNs) are typically orders of magnitude faster than IDT methods.
With the development of deep Convolutional Neural Networks (CNN), it achieved the most advanced performance in the task of image recognition. Many studies have designed effective deep convolutional neural networks for motion recognition.
The existing deep learning methods are mainly divided into two types:
the method comprises the following steps of firstly cutting a video into a frame sequence, then calculating an optical flow image of the video by using the frame sequence, designing two convolutional neural networks (a spatial stream convolutional neural network and a temporal stream convolutional neural network), convolving the video frame image by using the spatial stream convolutional neural network (spatial stream convolutional net), extracting features (spatial features and static appearance information) of the frame image, convolving the video optical flow image by using the temporal stream convolutional neural network (temporal stream convolutional net), extracting features (temporal features and motion information) of the optical flow image, separately training the two streams, and finally simply fusing the two streams to output a prediction result. The method of the invention is based on a dual-flow framework.
Secondly, a single-stream method frame, the theme of the method is that firstly, a video is cut into a frame sequence, then a 3D convolution neural network is designed, the frame sequence is directly put into the 3D convolution neural network to directly extract the space-time characteristics of the video, and then the space-time characteristics are utilized to classify the behavior types in the video.
The existing method adopts a double-flow Convolutional neural network (Two-Stream Convolutional network) to perform behavior recognition, and the specific idea is as follows: video consists of two parts, a spatial part and a temporal part. In the space part, the video is superposed in the form of a single frame, and appearance information such as scenes, objects and the like of the video is contained. In the temporal part, the variation of the motion of the frames contains the motion in the camera (viewer) and the video. Therefore, the method designs a framework for video behavior recognition, and divides the framework into two streams, namely a spatial stream and a Temporal stream (Temporal ConvNet). The spatial stream and the temporal stream are both a deep convolutional network, and finally they are merged by softmax. It considers two fusion methods: one is to average; the other is training on a multi-classification linear SVM, and a score is calculated using softmax normalized by L2.
The method comprises the following specific steps:
the method comprises the following steps: the video is cut into frames and the optical flow is extracted.
Concept of optical flow: the method is a method for calculating the motion information of an object between adjacent frames by finding the corresponding relation between the previous frame and the current frame by using the change of pixels in an image sequence on a time domain and the correlation between the adjacent frames. The optical flow is due to movement of the foreground objects themselves in the scene, movement of the camera, or both.
Step two:
the input of the spatial stream is to randomly pick out any frame in a given video; then, the probability distribution value is obtained by connecting the convolution layer and the full connection layer into softmax.
The input of the time stream is to select the time of any frame in the video and then superpose the N frames behind the selected time into an optical flow stack to enter training: and the probability distribution value is obtained by connecting the network layers to softmax.
The input of the temporal stream is formed by stacking the optical flow displacement fields between several successive frames. The stacking principle is as follows:
the dense optical flow can be seen as a set of displacement vector fields d between successive frames t and t +1 in pairst,dt(u, v) point (u, v) indicating the t-th frame is moved toThe displacement vector of the corresponding point in the next frame t + 1. The displacement vector is divided into two directions, the horizontal direction
Figure BDA0002293269280000021
In the vertical direction
Figure BDA0002293269280000022
Can be combined with
Figure BDA0002293269280000023
And
Figure BDA0002293269280000024
viewed as two channels of an image, the invention stacks the flow path of L successive frames in order to represent motion across a series of frames
Figure BDA0002293269280000031
To form a total of 2L input channels. Let w and h be the width and height of the video frame; then adding IτThe size is (w × h × 2L) time-stream convolutional neural network. The superimposed optical flow calculation formula is as follows:
Figure BDA0002293269280000032
Figure BDA0002293269280000033
for arbitrary points (u, v), Iτ(u,v,c);c=[1,2L]The motion of the point over the sequence of L frames is encoded. All point aggregation is the flow of the stack.
Step three: fusion
The fusion of the fractional values of the two streams (streams) forms the result of the final classification.
The fusion method comprises the following steps: 1. a weighted average score; 2. support Vector Machine (SVM) methods. The experimental result shows that the SVM fusion mode is better. The SVM method is a linear programming method.
However, the existing method has the following disadvantages: 1. depending on the optical flow to be extracted from the video in advance, the optical flow features are relearned for recognition of the motion, reducing the efficiency of the entire network. 2. And finally, simple weighted fusion is carried out, the interaction of the spatial features and the temporal features cannot be captured, and the condition that one stream fails and the other stream succeeds exists, so that the overall recognition accuracy is influenced.
Disclosure of Invention
The invention mainly aims to provide a hidden double-flow cooperative learning network and a method for behavior recognition, and aims to solve the problems that the whole network efficiency is low and the recognition accuracy is influenced in the conventional behavior recognition method.
In order to achieve the above object, the present invention provides a hidden double-flow collaborative learning network for behavior recognition, where the hidden double-flow collaborative learning network includes a hidden double-flow model and a collaborative learning model; the hidden double-flow model comprises a space flow for extracting discriminant static characteristics and a hidden time flow for obtaining motion characteristics; the collaborative learning model is used for optimizing static and motion characteristics, adaptively learning the fusion weight of each video category and finally obtaining a prediction result.
Preferably, the spatial stream is used to input a static frame of video into a convolutional neural network, capturing a static feature of a picture.
Preferably, the hidden temporal stream is divided into an optical flow estimation portion and a feature extraction portion for extracting motion features in the optical flow estimated by the optical flow estimation portion, wherein a network of the optical flow estimation portion calculates a plurality of penalties on a plurality of scales, the penalty on each scale being a weighted sum of a standard pixel reconstruction penalty, a smoothness penalty, and a region-based structural similarity penalty.
Preferably, the function of the standard pixel reconstruction loss is:
Figure BDA0002293269280000041
wherein the content of the first and second substances,
Figure BDA0002293269280000042
is the estimated optical flow, h, in the horizontal and vertical directions of the pixel point (a, b)And w represents the image I1And I2Height and width of;
the smoothness loss function is:
Figure BDA0002293269280000043
wherein the content of the first and second substances,
Figure BDA0002293269280000044
and
Figure BDA0002293269280000045
is the estimated optical flow field V in each directionxThe gradient of (a) of (b) is,
Figure BDA0002293269280000046
and
Figure BDA0002293269280000047
is optical flow field VyA gradient of (a);
the structural similarity loss function is:
Figure BDA0002293269280000048
wherein the content of the first and second substances,
Figure BDA0002293269280000049
and
Figure BDA00022932692800000410
are respectively an image I1And I2Local mass of, mup1、μp2Is an image block
Figure BDA00022932692800000411
And
Figure BDA00022932692800000412
average value of (a) ("sigmap1And σp2Is the variance, σ, of two image blocksp1p2Is the covariance, c1And c2Are two constants.
Preferably, the step of optimizing static and motion features in the collaborative learning model comprises:
(1) initializing optimization coefficients on frame features to zs,zsAll N elements are set to be 1/N;
(2) repeating the steps for one time;
(3) feature V of framesMerge into a single vector as a video feature
Figure BDA00022932692800000413
(4) By passing
Figure BDA00022932692800000414
And
Figure BDA00022932692800000415
using OsOptimizing optical flow characteristics and obtaining an optimization coefficient z on the optical flow characteristicsm
(5) By passing
Figure BDA00022932692800000416
Characterizing the optical flow VmMerging into a single vector as a video feature;
(6) using OmOptimizing frame characteristics and obtaining an optimization coefficient z on the frame characteristicss
(7) Until the loss function converges;
(8) returning optimized frame features
Figure BDA00022932692800000417
And optimized optical flow features
Figure BDA00022932692800000418
Wherein the frame feature is a static feature, the optical flow feature is a motion feature, and the optical flow feature is
Figure BDA00022932692800000419
Figure BDA00022932692800000420
1 is a vector with all elements 1, OsIs a video feature, z, of a combination of frame features obtained from time t-1mIs a learning optimization coefficient on the optical flow characteristics, OmIs a video feature merged from the optical flow, Wm
Figure BDA00022932692800000421
And
Figure BDA00022932692800000422
is a weight parameter, a frame characteristic
Figure BDA00022932692800000423
Preferably, the step of adaptively learning the fusion weight of each video category in the collaborative learning model to obtain the prediction result includes:
different fusion weights for different classes of static and motion streams are learned adaptively, with the final classification result determined by the highest fusion score.
In order to achieve the above object, the present invention provides a hidden-double-flow cooperative learning method for behavior recognition, which includes the following steps:
the input video is firstly decomposed into a frame sequence, then the frame sequence is respectively sent into the spatial stream extraction discriminant static feature and the implicit time stream extraction discriminant static feature of the implicit double-stream cooperative learning network to obtain the motion feature, the cooperative learning network is carried out to optimize the static and motion features after the feature is obtained, the fusion weight of each video category is adaptively learned, and finally the prediction result is obtained.
The invention provides a novel network architecture, which hides the step of extracting the optical flow in a network structure, thereby greatly accelerating the speed of the network; on the basis of double flows, the double-flow cooperation module captures the interaction of the time characteristic and the space characteristic so as to improve the identification precision of the whole network.
Drawings
FIG. 1 is a flow chart of a hidden-double-flow cooperative learning method for behavior recognition according to the present invention;
FIG. 2 is a block diagram of a hidden-dual-flow collaborative learning network for behavior recognition according to the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1 to fig. 2, the hidden-double-flow cooperative learning method includes two models: the method comprises the following specific steps of: the method comprises the steps of decomposing an input video into a frame sequence, sending the frame sequence into a Spatial stream (Spatial stream CNN) to extract a discriminant static feature and a Hidden time stream (high temporal CNN, Spatial stream CNN) to obtain a motion feature, performing collaborative learning network optimization on the static and motion features after the features are obtained, learning fusion weight of each video category in a self-adaptive manner, and finally obtaining a prediction result. The hidden-dual-flow model and the collaborative learning model are specifically described below.
One-hidden double-flow module
The invention hopes to learn not only static appearance characteristics but also characteristics containing motion information from a video frame sequence, so as to be used as a basis for judging the motion type of the video. The static frame of the video is input into the convolutional neural network, and the motion recognition of the static image can be effectively realized, so the function of the space flow of the invention is the same as that of the double-flow network, and the space flow is used for capturing the static appearance information of the picture. FlowNet demonstrates that optical flow can be estimated with CNN, and the present invention is intended to learn optical flow information of a frame sequence using CNN architecture and to contribute to human behavior recognition task. The specific details are as follows:
space flow of A
Static appearance features (color, lighting, texture, contours, etc.) are themselves a useful cue, as some actions are closely related to a particular object and scene. The input of the spatial stream ConvNet of the present invention is a still frame of a video, and motion recognition of a still image can be effectively realized. In fact, the behavior classification of the stationary frames (spatial recognition stream) is itself quite competitive. Since spatial convolutional networks are essentially an image classification architecture, we can pre-train the network on large image classification datasets (e.g., ImageNet challenge datasets) based on the latest advances in large-scale image recognition methods. Table 1 is the architecture of the spatial flow network in the implicit dual flow model (set M to the number of corresponding dataset classes).
TABLE 1
Figure BDA0002293269280000061
Hidden time stream B
Although there are many actions that can be discriminated using only a single frame image, there are some actions that depend on timing information. So the time stream of the original Two-stream network takes the optical flow image as input. In the conventional method, an optical flow image is obtained by operating a video by a TVL1 or the like. The optical flow image contains information which is helpful for the behavior recognition task, but the original method needs to extract the optical flow information in advance, the extraction speed is slow, and the optical flow image containing the optical flow information needs additional storage space to be stored.
The invention regards optical flow estimation as an image reconstruction problem, and learns optical flow information which is helpful for an identification task of a frame sequence by using a time flow CNN. The present invention seeks to generate an effective optical flow of adjacent frames by CNN. A pair of adjacent frames I1And I2As an input, if the estimated optical flow and the next frame can be used to reconstruct the current frame, it is proven that the network has learned the motion information.
The hidden temporal flow is divided into an optical flow estimation section and a feature extraction section. The details of the optical flow estimation part network are shown in table 2, and the network structure of the feature extraction part is the same as that of the spatial flow network.
TABLE 2
Figure BDA0002293269280000071
The invention computes multiple losses at multiple scales over a network of optical flow estimation sections. Specifically, three loss functions are utilized to help produce better optical flow, the loss functions being as follows:
the standard pixel reconstruction error function is:
Figure BDA0002293269280000072
Figure BDA0002293269280000073
is the estimated optical flow in the horizontal and vertical directions of the pixel point (a, b), h and w representing the image I1And I2Height and width of (a). In order to reduce the influence of abnormal values, the invention adopts Charbonier Peaalty function rho (x) ═ x22)α(an L1 loss variant was first used as a loss function in LapSRN).
Figure BDA0002293269280000074
LsmFor the smoothness loss function, the aperture problem that leads to blurring when estimating motion in non-textured areas is solved,
Figure BDA0002293269280000075
and
Figure BDA0002293269280000076
is the estimated optical flow field V in each directionxThe gradient of (a) is, likewise,
Figure BDA0002293269280000077
and
Figure BDA0002293269280000078
then the optical flow field VyIs the same as in equation (1).
Figure BDA0002293269280000081
Structural Similarity (SSIM) loss function, which helps us learn the structure of the frame, where
Figure BDA0002293269280000088
And
Figure BDA0002293269280000089
are respectively an image I1And I2Is set to 8x 8. Mu.sp1,μp2Is an image block
Figure BDA00022932692800000810
And
Figure BDA00022932692800000811
average value of (a) ("sigmap1And σp2Is the variance, σ, of two image blocksp1p2Is the covariance, c1And c2Are two constants that stabilize the division. Set to 0.0001 and 0.001, respectively, in the experiment.
Figure BDA0002293269280000082
For comparing two images I1And I1The similarity between the two is defined as L in the SSIM loss function of the present inventionssWhere N is the number of local blocks that can be extracted from the image and N is the index of the local block.
Ls=λ1Lp1Lss3Lsm(5)
The penalty per scale s is therefore a weighted sum of the pixel reconstruction penalty, the piecewise smoothing penalty and the region-based SSIM penalty,
Figure BDA0002293269280000083
δsthe settings are to balance the losses on each scale and are identical in number.
And the feature extraction part is also similar to the CNN structure of the spatial stream. The optical flow estimated by the optical flow estimation portion needs to be normalized before being sent to the CNN where the features are extracted. A limit of more than 20 pixels is first limited to 20 pixels. It is then normalized to range between 0 and 255. This normalization is important for good time-stream performance. The last temporal stream extracts features that contain optical flow information.
Two, cooperation learning module
For both streams and their derivatives, careful inspection of their models reveals that for most misclassification cases, one stream usually fails, while the other is still correct, affecting the overall recognition accuracy. Therefore, it is not sufficient to simply average the output of the classifier layers. Rather, the present invention seeks to facilitate spatial and temporal cues in relation to each other. In order to capture the interaction of spatial (static) information and temporal (motion) information, the invention expects the static features and the motion features to interact, so the collaborative learning module of the invention mutually guides optimization of the static features and the motion features by using the static and motion information of which the static and motion information have a symmetrical structure.
At time t, optimizing optical flow characteristics by using the frame characteristics, wherein the specific formula is as follows:
Figure BDA0002293269280000084
Figure BDA0002293269280000085
Figure BDA0002293269280000086
wherein: optical flow characteristics:
Figure BDA0002293269280000087
1 is a vector with all elements 1, OsIs a video feature, z, of a combination of frame features obtained from time t-1mIs a learning optimization coefficient on the optical flow characteristics, OmIs a video feature merged from the optical flow, Wm
Figure BDA0002293269280000091
And
Figure BDA0002293269280000092
is a weight parameter.
At time t +1, frame features are optimized using optical flow features
Figure BDA0002293269280000093
In the collaborative learning module, inputting: extracting frame and optical flow characteristics from the hidden double-flow model; and (3) outputting: optimized frame features
Figure BDA00022932692800000917
And optical flow features
Figure BDA00022932692800000916
The specific algorithm steps are as follows:
1. initializing optimization coefficients on frame features to zs,zsAll N elements are set to be 1/N;
2. repeating the steps for one time;
3. feature V of framesMerge into a single vector as a video feature
Figure BDA0002293269280000094
4. Using O by equations (7) and (8)sOptimizing optical flow characteristics and obtaining an optimization coefficient z on the optical flow characteristicsm
5. Optical flow feature V by equation (9)mMerging into a single vector as a video feature;
6. using OmOptimizing frame characteristics and obtaining an optimization coefficient z on the frame characteristicss
7. Until the loss function converges;
8. returning optimized frame features
Figure BDA00022932692800000914
And optimized optical flow features
Figure BDA00022932692800000915
Since the predicted scores (static and motion) for each stream have been obtained, we can simply summarize these scores and obtain the final result. However, static and motion information contribute differently to different video categories. Some categories, such as "archery" and "smoke", should not be identified primarily from static frames (static information), while some categories contain significant motion, which is important for distinguishing categories, such as "walking and" crawling ". Thus, the present invention adaptively learns different fusion weights for different classes of static and motion streams.
Predictive score for ith training data in jth class
Figure BDA0002293269280000095
Where c represents the number of categories,
Figure BDA0002293269280000096
a score representing the mth stream of the ith training data in the jth category.
The present invention represents the fusion weight of the jth class as
Figure BDA0002293269280000097
Limiting
Figure BDA0002293269280000098
Figure BDA0002293269280000099
The fusion weight of each category is learned separately, and the weight W is obtainedj
The objective function is as follows: (λ is set to 5 × 10)-3)
Figure BDA00022932692800000910
pjThe definition is as follows:
Figure BDA00022932692800000911
Figure BDA00022932692800000912
wherein N isjIndicates the number of training data of the jth class,
Figure BDA00022932692800000913
where the jth element is 1 and the other elements are 0, the way to maximize Pj is to maximize the product of the jth column vectors of Wj and Sji, which also means minimizing Wj and Sk iIs equal to j, is calculated as the product of the jth column vector of (k is not equal to j). Pj and Nj consider the relationship of positive and negative training data for Wj, respectively, and are parameters that balance the weights of the positive and negative samples. Equation (10) can then be transformed into:
Figure BDA0002293269280000101
Figure BDA0002293269280000102
the fusion weights are learned by linear programming, and for the test data the softmax score for each stream is first calculated and superimposed, as:
by the formula
Figure BDA0002293269280000103
And (3) prediction:
Figure BDA0002293269280000104
the final classification result is determined by the highest fusion score.
The following beneficial effects can be obtained through the scheme:
1. the hidden time stream can estimate the optical flow information of the video through the frame sequence of the video, and then the category of the video behavior is directly obtained by combining the spatial stream. The calculation cost required for extracting the optical flow image in advance is not required.
2. The invention can capture the interaction of static information (space flow) and motion information (hidden time flow) through a collaborative learning module, and mutually enhance the time and space characteristics. The identification precision of video behavior identification is improved.
3. The invention can save the storage space required by the prior method for storing the optical flow image in advance.
In the description herein, references to the description of the term "one embodiment," "another embodiment," or "first through xth embodiments," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, method steps, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (7)

1. The hidden double-flow cooperative learning network for behavior recognition is characterized by comprising a hidden double-flow model and a cooperative learning model; the hidden double-flow model comprises a space flow for extracting discriminant static characteristics and a hidden time flow for obtaining motion characteristics; the collaborative learning model is used for optimizing static and motion characteristics, adaptively learning the fusion weight of each video category and finally obtaining a prediction result.
2. The implicit dual-stream collaborative learning network for behavior recognition according to claim 1, wherein the spatial stream is used to input static frames of video into a convolutional neural network, capturing static features of pictures.
3. The hidden-dual-flow collaborative learning network for behavior recognition according to claim 1, wherein the hidden temporal flow is divided into an optical flow estimation portion and a feature extraction portion for extracting motion features in the optical flow estimated by the optical flow estimation portion, wherein the network of optical flow estimation portions computes a plurality of penalties on a plurality of scales, the penalty on each scale being a weighted sum of a standard pixel reconstruction penalty, a smoothness penalty, and a region-based structural similarity penalty.
4. The implicit dual-flow collaborative learning network for behavior recognition according to claim 3, wherein the function of the standard pixel reconstruction loss is:
Figure FDA0002293269270000011
wherein the content of the first and second substances,
Figure FDA0002293269270000012
is the estimated optical flow in the horizontal and vertical directions of the pixel point (a, b), h and w representing the image I1And I2Height and width of;
the smoothness loss function is:
Figure FDA0002293269270000013
wherein the content of the first and second substances,
Figure FDA0002293269270000014
and
Figure FDA0002293269270000015
is the estimated optical flow field V in each directionxThe gradient of (a) of (b) is,
Figure FDA0002293269270000016
and
Figure FDA0002293269270000017
is optical flow field VyA gradient of (a);
the structural similarity loss function is:
Figure FDA0002293269270000021
wherein the content of the first and second substances,
Figure FDA0002293269270000022
and
Figure FDA0002293269270000023
are respectively an image I1And I2Local mass of, mup1、μp2Is an image block
Figure FDA0002293269270000024
And
Figure FDA0002293269270000025
average value of (a) ("sigmap1And σp2Is the variance, σ, of two image blocksp1p2Is the covariance, c1And c2Are two constants.
5. The implicit dual-flow collaborative learning network for behavior recognition according to any of claims 1-4, wherein the step of optimizing static and motion features in the collaborative learning model comprises:
(1) initializing optimization coefficients on frame features to zs,zsAll N elements are set to be 1/N;
(2) repeating the steps for one time;
(3) feature V of framesMerge into a single vector as a video feature
Figure FDA0002293269270000026
(4) By passing
Figure FDA0002293269270000027
And
Figure FDA0002293269270000028
using OsOptimizing optical flow characteristics and obtaining an optimization coefficient z on the optical flow characteristicsm
(5) By passing
Figure FDA0002293269270000029
Characterizing the optical flow VmAre combined intoA single vector is used as a video feature;
(6) using OmOptimizing frame characteristics and obtaining an optimization coefficient z on the frame characteristicss
(7) Until the loss function converges;
(8) returning optimized frame features
Figure FDA00022932692700000210
And optimized optical flow features
Figure FDA00022932692700000211
Wherein the frame feature is a static feature, the optical flow feature is a motion feature, and the optical flow feature is
Figure FDA00022932692700000212
Figure FDA00022932692700000213
1 is a vector with all elements 1, OsIs a video feature, z, of a combination of frame features obtained from time t-1mIs a learning optimization coefficient on the optical flow characteristics, OmIs a video feature merged from the optical flow, Wm
Figure FDA00022932692700000214
And
Figure FDA00022932692700000215
is a weight parameter, a frame characteristic
Figure FDA00022932692700000216
6. The implicit dual-flow collaborative learning network for behavior recognition according to any one of claims 1 to 4, wherein the step of adaptively learning the fusion weight of each video category in the collaborative learning model to obtain the prediction result comprises:
different fusion weights for different classes of static and motion streams are learned adaptively, with the final classification result determined by the highest fusion score.
7. A hidden double-flow cooperative learning method for behavior recognition is characterized by comprising the following steps of:
an input video is firstly decomposed into a frame sequence, then the frame sequence is respectively sent into the spatial stream extraction discriminant static characteristics and the hidden time stream extraction discriminant motion characteristics of the hidden double-stream cooperative learning network according to any one of claims 1 to 6, after the characteristics are obtained, the cooperative learning network is carried out to optimize the static and motion characteristics, the fusion weight of each video category is adaptively learned, and finally, a prediction result is obtained.
CN201911189752.6A 2019-11-28 2019-11-28 Hidden-double-flow cooperative learning network and method for behavior recognition Active CN110889375B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911189752.6A CN110889375B (en) 2019-11-28 2019-11-28 Hidden-double-flow cooperative learning network and method for behavior recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911189752.6A CN110889375B (en) 2019-11-28 2019-11-28 Hidden-double-flow cooperative learning network and method for behavior recognition

Publications (2)

Publication Number Publication Date
CN110889375A true CN110889375A (en) 2020-03-17
CN110889375B CN110889375B (en) 2022-12-20

Family

ID=69749221

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911189752.6A Active CN110889375B (en) 2019-11-28 2019-11-28 Hidden-double-flow cooperative learning network and method for behavior recognition

Country Status (1)

Country Link
CN (1) CN110889375B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582230A (en) * 2020-05-21 2020-08-25 电子科技大学 Video behavior classification method based on space-time characteristics
CN111639548A (en) * 2020-05-11 2020-09-08 华南理工大学 Door-based video context multi-modal perceptual feature optimization method
CN111931603A (en) * 2020-07-22 2020-11-13 北方工业大学 Human body action recognition system and method based on double-current convolution network of competitive combination network
CN112025692A (en) * 2020-09-01 2020-12-04 广东工业大学 Control method and device for self-learning robot and electronic equipment
CN112115788A (en) * 2020-08-14 2020-12-22 咪咕文化科技有限公司 Video motion recognition method and device, electronic equipment and storage medium
CN112347996A (en) * 2020-11-30 2021-02-09 上海眼控科技股份有限公司 Scene state judgment method, device, equipment and storage medium
CN112767645A (en) * 2021-02-02 2021-05-07 南京恩博科技有限公司 Smoke identification method and device and electronic equipment
CN114821760A (en) * 2021-01-27 2022-07-29 四川大学 Human body abnormal behavior detection method based on double-flow space-time automatic coding machine

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150365696A1 (en) * 2014-06-13 2015-12-17 Texas Instruments Incorporated Optical flow determination using pyramidal block matching
CN105678216A (en) * 2015-12-21 2016-06-15 中国石油大学(华东) Spatio-temporal data stream video behavior recognition method based on deep learning
CN107220616A (en) * 2017-05-25 2017-09-29 北京大学 A kind of video classification methods of the two-way Cooperative Study based on adaptive weighting
CN109325430A (en) * 2018-09-11 2019-02-12 北京飞搜科技有限公司 Real-time Activity recognition method and system
US20190205629A1 (en) * 2018-01-04 2019-07-04 Beijing Kuangshi Technology Co., Ltd. Behavior predicton method, behavior predicton system, and non-transitory recording medium
CN110188637A (en) * 2019-05-17 2019-08-30 西安电子科技大学 A kind of Activity recognition technical method based on deep learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150365696A1 (en) * 2014-06-13 2015-12-17 Texas Instruments Incorporated Optical flow determination using pyramidal block matching
CN105678216A (en) * 2015-12-21 2016-06-15 中国石油大学(华东) Spatio-temporal data stream video behavior recognition method based on deep learning
CN107220616A (en) * 2017-05-25 2017-09-29 北京大学 A kind of video classification methods of the two-way Cooperative Study based on adaptive weighting
US20190205629A1 (en) * 2018-01-04 2019-07-04 Beijing Kuangshi Technology Co., Ltd. Behavior predicton method, behavior predicton system, and non-transitory recording medium
CN109325430A (en) * 2018-09-11 2019-02-12 北京飞搜科技有限公司 Real-time Activity recognition method and system
CN110188637A (en) * 2019-05-17 2019-08-30 西安电子科技大学 A kind of Activity recognition technical method based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘天亮等: "融合空间-时间双网络流和视觉注意的人体行为识别", 《电子与信息学报》 *
杨妙: "用于行为识别的双流卷积神经网络微调算法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111639548A (en) * 2020-05-11 2020-09-08 华南理工大学 Door-based video context multi-modal perceptual feature optimization method
CN111582230A (en) * 2020-05-21 2020-08-25 电子科技大学 Video behavior classification method based on space-time characteristics
CN111931603A (en) * 2020-07-22 2020-11-13 北方工业大学 Human body action recognition system and method based on double-current convolution network of competitive combination network
CN111931603B (en) * 2020-07-22 2024-01-12 北方工业大学 Human body action recognition system and method of double-flow convolution network based on competitive network
CN112115788A (en) * 2020-08-14 2020-12-22 咪咕文化科技有限公司 Video motion recognition method and device, electronic equipment and storage medium
CN112025692A (en) * 2020-09-01 2020-12-04 广东工业大学 Control method and device for self-learning robot and electronic equipment
CN112347996A (en) * 2020-11-30 2021-02-09 上海眼控科技股份有限公司 Scene state judgment method, device, equipment and storage medium
CN114821760A (en) * 2021-01-27 2022-07-29 四川大学 Human body abnormal behavior detection method based on double-flow space-time automatic coding machine
CN114821760B (en) * 2021-01-27 2023-10-27 四川大学 Human body abnormal behavior detection method based on double-flow space-time automatic encoder
CN112767645A (en) * 2021-02-02 2021-05-07 南京恩博科技有限公司 Smoke identification method and device and electronic equipment

Also Published As

Publication number Publication date
CN110889375B (en) 2022-12-20

Similar Documents

Publication Publication Date Title
CN110889375B (en) Hidden-double-flow cooperative learning network and method for behavior recognition
CN109711316B (en) Pedestrian re-identification method, device, equipment and storage medium
Sun et al. Lattice long short-term memory for human action recognition
CN109919122A (en) A kind of timing behavioral value method based on 3D human body key point
CN110120064B (en) Depth-related target tracking algorithm based on mutual reinforcement and multi-attention mechanism learning
CN110334589B (en) High-time-sequence 3D neural network action identification method based on hole convolution
CN110378208B (en) Behavior identification method based on deep residual error network
CN112926396A (en) Action identification method based on double-current convolution attention
CN112070044B (en) Video object classification method and device
CN111260738A (en) Multi-scale target tracking method based on relevant filtering and self-adaptive feature fusion
CN111931603B (en) Human body action recognition system and method of double-flow convolution network based on competitive network
CN106650617A (en) Pedestrian abnormity identification method based on probabilistic latent semantic analysis
CN112801068B (en) Video multi-target tracking and segmenting system and method
CN113963032A (en) Twin network structure target tracking method fusing target re-identification
Yi et al. Human action recognition based on action relevance weighted encoding
CN114708649A (en) Behavior identification method based on integrated learning method and time attention diagram convolution
CN111382602A (en) Cross-domain face recognition algorithm, storage medium and processor
CN113807176A (en) Small sample video behavior identification method based on multi-knowledge fusion
CN116168329A (en) Video motion detection method, equipment and medium based on key frame screening pixel block
Li et al. Robust foreground segmentation based on two effective background models
CN112528077B (en) Video face retrieval method and system based on video embedding
CN109993151A (en) A kind of 3 D video visual attention detection method based on the full convolutional network of multimode
CN113239866A (en) Face recognition method and system based on space-time feature fusion and sample attention enhancement
CN115690917B (en) Pedestrian action identification method based on intelligent attention of appearance and motion
Huang et al. Temporally-aggregating multiple-discontinuous-image saliency prediction with transformer-based attention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant