CN110889375A - Hidden and double-flow cooperative learning network and method for behavior recognition - Google Patents
Hidden and double-flow cooperative learning network and method for behavior recognition Download PDFInfo
- Publication number
- CN110889375A CN110889375A CN201911189752.6A CN201911189752A CN110889375A CN 110889375 A CN110889375 A CN 110889375A CN 201911189752 A CN201911189752 A CN 201911189752A CN 110889375 A CN110889375 A CN 110889375A
- Authority
- CN
- China
- Prior art keywords
- flow
- optical flow
- video
- static
- hidden
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 43
- 230000006399 behavior Effects 0.000 title claims abstract description 29
- 230000003287 optical effect Effects 0.000 claims abstract description 81
- 230000003068 static effect Effects 0.000 claims abstract description 41
- 230000004927 fusion Effects 0.000 claims abstract description 21
- 238000013527 convolutional neural network Methods 0.000 claims description 26
- 230000006870 function Effects 0.000 claims description 19
- 230000002123 temporal effect Effects 0.000 claims description 18
- 238000005457 optimization Methods 0.000 claims description 14
- 239000013598 vector Substances 0.000 claims description 14
- 238000000605 extraction Methods 0.000 claims description 11
- 230000003993 interaction Effects 0.000 abstract description 6
- 238000003860 storage Methods 0.000 abstract description 4
- 238000012549 training Methods 0.000 description 7
- 238000012706 support-vector machine Methods 0.000 description 5
- 238000006073 displacement reaction Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000007500 overflow downdraw method Methods 0.000 description 2
- 101100194606 Mus musculus Rfxank gene Proteins 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000002860 competitive effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 239000000779 smoke Substances 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
- G06V20/42—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/192—Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references
- G06V30/194—References adjustable by an adaptive method, e.g. learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a hidden double-flow cooperative learning network and a method for behavior recognition, wherein the hidden double-flow cooperative learning network comprises a hidden double-flow model and a cooperative learning model; the hidden double-flow model comprises a space flow for extracting discriminant static characteristics and a hidden time flow for obtaining motion characteristics; the collaborative learning model is used for optimizing static and motion characteristics, adaptively learning the fusion weight of each video category and finally obtaining a prediction result. The invention can directly obtain the category of the video through the frame sequence of the video, and mutually enhance the time and space characteristics by capturing the interaction of the static information and the motion information through cooperative learning, thereby saving the storage space required by the prior method for storing the optical flow image in advance.
Description
Technical Field
The invention relates to the technical field of video processing, in particular to a hidden double-flow cooperative learning network and a method for behavior recognition.
Background
Behavior recognition is the recognition of different actions from clips (2D frame sequences) of a piece of video. Behavior recognition appears to be an extension of the image classification task to multiple frames, and then aggregates the predictions from each frame. Despite the great success of image classification, video behavior recognition continues to progress slowly.
As internet video traffic accounts for a higher proportion of internet traffic and most videos are human subject, in such a situation, comprehension analysis of videos is urgently needed. Video behavior recognition is a very important task and has wide applications, such as video search, intelligent monitoring, human-computer interaction and elderly care. Video motion recognition is a central problem in computer vision. In recent years, human motion recognition in video has advanced significantly. First, conventional manual frame extraction, such as the modified dense track approach (IDT), is the best, most stable, and most reliable method before deep learning is applied in the field, but this method is slow. Convolutional Neural Networks (CNNs) are typically orders of magnitude faster than IDT methods.
With the development of deep Convolutional Neural Networks (CNN), it achieved the most advanced performance in the task of image recognition. Many studies have designed effective deep convolutional neural networks for motion recognition.
The existing deep learning methods are mainly divided into two types:
the method comprises the following steps of firstly cutting a video into a frame sequence, then calculating an optical flow image of the video by using the frame sequence, designing two convolutional neural networks (a spatial stream convolutional neural network and a temporal stream convolutional neural network), convolving the video frame image by using the spatial stream convolutional neural network (spatial stream convolutional net), extracting features (spatial features and static appearance information) of the frame image, convolving the video optical flow image by using the temporal stream convolutional neural network (temporal stream convolutional net), extracting features (temporal features and motion information) of the optical flow image, separately training the two streams, and finally simply fusing the two streams to output a prediction result. The method of the invention is based on a dual-flow framework.
Secondly, a single-stream method frame, the theme of the method is that firstly, a video is cut into a frame sequence, then a 3D convolution neural network is designed, the frame sequence is directly put into the 3D convolution neural network to directly extract the space-time characteristics of the video, and then the space-time characteristics are utilized to classify the behavior types in the video.
The existing method adopts a double-flow Convolutional neural network (Two-Stream Convolutional network) to perform behavior recognition, and the specific idea is as follows: video consists of two parts, a spatial part and a temporal part. In the space part, the video is superposed in the form of a single frame, and appearance information such as scenes, objects and the like of the video is contained. In the temporal part, the variation of the motion of the frames contains the motion in the camera (viewer) and the video. Therefore, the method designs a framework for video behavior recognition, and divides the framework into two streams, namely a spatial stream and a Temporal stream (Temporal ConvNet). The spatial stream and the temporal stream are both a deep convolutional network, and finally they are merged by softmax. It considers two fusion methods: one is to average; the other is training on a multi-classification linear SVM, and a score is calculated using softmax normalized by L2.
The method comprises the following specific steps:
the method comprises the following steps: the video is cut into frames and the optical flow is extracted.
Concept of optical flow: the method is a method for calculating the motion information of an object between adjacent frames by finding the corresponding relation between the previous frame and the current frame by using the change of pixels in an image sequence on a time domain and the correlation between the adjacent frames. The optical flow is due to movement of the foreground objects themselves in the scene, movement of the camera, or both.
Step two:
the input of the spatial stream is to randomly pick out any frame in a given video; then, the probability distribution value is obtained by connecting the convolution layer and the full connection layer into softmax.
The input of the time stream is to select the time of any frame in the video and then superpose the N frames behind the selected time into an optical flow stack to enter training: and the probability distribution value is obtained by connecting the network layers to softmax.
The input of the temporal stream is formed by stacking the optical flow displacement fields between several successive frames. The stacking principle is as follows:
the dense optical flow can be seen as a set of displacement vector fields d between successive frames t and t +1 in pairst,dt(u, v) point (u, v) indicating the t-th frame is moved toThe displacement vector of the corresponding point in the next frame t + 1. The displacement vector is divided into two directions, the horizontal directionIn the vertical directionCan be combined withAndviewed as two channels of an image, the invention stacks the flow path of L successive frames in order to represent motion across a series of framesTo form a total of 2L input channels. Let w and h be the width and height of the video frame; then adding IτThe size is (w × h × 2L) time-stream convolutional neural network. The superimposed optical flow calculation formula is as follows:
for arbitrary points (u, v), Iτ(u,v,c);c=[1,2L]The motion of the point over the sequence of L frames is encoded. All point aggregation is the flow of the stack.
Step three: fusion
The fusion of the fractional values of the two streams (streams) forms the result of the final classification.
The fusion method comprises the following steps: 1. a weighted average score; 2. support Vector Machine (SVM) methods. The experimental result shows that the SVM fusion mode is better. The SVM method is a linear programming method.
However, the existing method has the following disadvantages: 1. depending on the optical flow to be extracted from the video in advance, the optical flow features are relearned for recognition of the motion, reducing the efficiency of the entire network. 2. And finally, simple weighted fusion is carried out, the interaction of the spatial features and the temporal features cannot be captured, and the condition that one stream fails and the other stream succeeds exists, so that the overall recognition accuracy is influenced.
Disclosure of Invention
The invention mainly aims to provide a hidden double-flow cooperative learning network and a method for behavior recognition, and aims to solve the problems that the whole network efficiency is low and the recognition accuracy is influenced in the conventional behavior recognition method.
In order to achieve the above object, the present invention provides a hidden double-flow collaborative learning network for behavior recognition, where the hidden double-flow collaborative learning network includes a hidden double-flow model and a collaborative learning model; the hidden double-flow model comprises a space flow for extracting discriminant static characteristics and a hidden time flow for obtaining motion characteristics; the collaborative learning model is used for optimizing static and motion characteristics, adaptively learning the fusion weight of each video category and finally obtaining a prediction result.
Preferably, the spatial stream is used to input a static frame of video into a convolutional neural network, capturing a static feature of a picture.
Preferably, the hidden temporal stream is divided into an optical flow estimation portion and a feature extraction portion for extracting motion features in the optical flow estimated by the optical flow estimation portion, wherein a network of the optical flow estimation portion calculates a plurality of penalties on a plurality of scales, the penalty on each scale being a weighted sum of a standard pixel reconstruction penalty, a smoothness penalty, and a region-based structural similarity penalty.
Preferably, the function of the standard pixel reconstruction loss is:
wherein,is the estimated optical flow, h, in the horizontal and vertical directions of the pixel point (a, b)And w represents the image I1And I2Height and width of;
the smoothness loss function is:
wherein,andis the estimated optical flow field V in each directionxThe gradient of (a) of (b) is,andis optical flow field VyA gradient of (a);
the structural similarity loss function is:
wherein,andare respectively an image I1And I2Local mass of, mup1、μp2Is an image blockAndaverage value of (a) ("sigmap1And σp2Is the variance, σ, of two image blocksp1p2Is the covariance, c1And c2Are two constants.
Preferably, the step of optimizing static and motion features in the collaborative learning model comprises:
(1) initializing optimization coefficients on frame features to zs,zsAll N elements are set to be 1/N;
(2) repeating the steps for one time;
(4) By passingAndusing OsOptimizing optical flow characteristics and obtaining an optimization coefficient z on the optical flow characteristicsm;
(6) using OmOptimizing frame characteristics and obtaining an optimization coefficient z on the frame characteristicss;
(7) Until the loss function converges;
Wherein the frame feature is a static feature, the optical flow feature is a motion feature, and the optical flow feature is 1 is a vector with all elements 1, OsIs a video feature, z, of a combination of frame features obtained from time t-1mIs a learning optimization coefficient on the optical flow characteristics, OmIs a video feature merged from the optical flow, Wm、Andis a weight parameter, a frame characteristic
Preferably, the step of adaptively learning the fusion weight of each video category in the collaborative learning model to obtain the prediction result includes:
different fusion weights for different classes of static and motion streams are learned adaptively, with the final classification result determined by the highest fusion score.
In order to achieve the above object, the present invention provides a hidden-double-flow cooperative learning method for behavior recognition, which includes the following steps:
the input video is firstly decomposed into a frame sequence, then the frame sequence is respectively sent into the spatial stream extraction discriminant static feature and the implicit time stream extraction discriminant static feature of the implicit double-stream cooperative learning network to obtain the motion feature, the cooperative learning network is carried out to optimize the static and motion features after the feature is obtained, the fusion weight of each video category is adaptively learned, and finally the prediction result is obtained.
The invention provides a novel network architecture, which hides the step of extracting the optical flow in a network structure, thereby greatly accelerating the speed of the network; on the basis of double flows, the double-flow cooperation module captures the interaction of the time characteristic and the space characteristic so as to improve the identification precision of the whole network.
Drawings
FIG. 1 is a flow chart of a hidden-double-flow cooperative learning method for behavior recognition according to the present invention;
FIG. 2 is a block diagram of a hidden-dual-flow collaborative learning network for behavior recognition according to the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1 to fig. 2, the hidden-double-flow cooperative learning method includes two models: the method comprises the following specific steps of: the method comprises the steps of decomposing an input video into a frame sequence, sending the frame sequence into a Spatial stream (Spatial stream CNN) to extract a discriminant static feature and a Hidden time stream (high temporal CNN, Spatial stream CNN) to obtain a motion feature, performing collaborative learning network optimization on the static and motion features after the features are obtained, learning fusion weight of each video category in a self-adaptive manner, and finally obtaining a prediction result. The hidden-dual-flow model and the collaborative learning model are specifically described below.
One-hidden double-flow module
The invention hopes to learn not only static appearance characteristics but also characteristics containing motion information from a video frame sequence, so as to be used as a basis for judging the motion type of the video. The static frame of the video is input into the convolutional neural network, and the motion recognition of the static image can be effectively realized, so the function of the space flow of the invention is the same as that of the double-flow network, and the space flow is used for capturing the static appearance information of the picture. FlowNet demonstrates that optical flow can be estimated with CNN, and the present invention is intended to learn optical flow information of a frame sequence using CNN architecture and to contribute to human behavior recognition task. The specific details are as follows:
space flow of A
Static appearance features (color, lighting, texture, contours, etc.) are themselves a useful cue, as some actions are closely related to a particular object and scene. The input of the spatial stream ConvNet of the present invention is a still frame of a video, and motion recognition of a still image can be effectively realized. In fact, the behavior classification of the stationary frames (spatial recognition stream) is itself quite competitive. Since spatial convolutional networks are essentially an image classification architecture, we can pre-train the network on large image classification datasets (e.g., ImageNet challenge datasets) based on the latest advances in large-scale image recognition methods. Table 1 is the architecture of the spatial flow network in the implicit dual flow model (set M to the number of corresponding dataset classes).
TABLE 1
Hidden time stream B
Although there are many actions that can be discriminated using only a single frame image, there are some actions that depend on timing information. So the time stream of the original Two-stream network takes the optical flow image as input. In the conventional method, an optical flow image is obtained by operating a video by a TVL1 or the like. The optical flow image contains information which is helpful for the behavior recognition task, but the original method needs to extract the optical flow information in advance, the extraction speed is slow, and the optical flow image containing the optical flow information needs additional storage space to be stored.
The invention regards optical flow estimation as an image reconstruction problem, and learns optical flow information which is helpful for an identification task of a frame sequence by using a time flow CNN. The present invention seeks to generate an effective optical flow of adjacent frames by CNN. A pair of adjacent frames I1And I2As an input, if the estimated optical flow and the next frame can be used to reconstruct the current frame, it is proven that the network has learned the motion information.
The hidden temporal flow is divided into an optical flow estimation section and a feature extraction section. The details of the optical flow estimation part network are shown in table 2, and the network structure of the feature extraction part is the same as that of the spatial flow network.
TABLE 2
The invention computes multiple losses at multiple scales over a network of optical flow estimation sections. Specifically, three loss functions are utilized to help produce better optical flow, the loss functions being as follows:
the standard pixel reconstruction error function is:
is the estimated optical flow in the horizontal and vertical directions of the pixel point (a, b), h and w representing the image I1And I2Height and width of (a). In order to reduce the influence of abnormal values, the invention adopts Charbonier Peaalty function rho (x) ═ x2+ε2)α(an L1 loss variant was first used as a loss function in LapSRN).
LsmFor the smoothness loss function, the aperture problem that leads to blurring when estimating motion in non-textured areas is solved,andis the estimated optical flow field V in each directionxThe gradient of (a) is, likewise,andthen the optical flow field VyIs the same as in equation (1).
Structural Similarity (SSIM) loss function, which helps us learn the structure of the frame, whereAndare respectively an image I1And I2Is set to 8x 8. Mu.sp1,μp2Is an image blockAndaverage value of (a) ("sigmap1And σp2Is the variance, σ, of two image blocksp1p2Is the covariance, c1And c2Are two constants that stabilize the division. Set to 0.0001 and 0.001, respectively, in the experiment.
For comparing two images I1And I1The similarity between the two is defined as L in the SSIM loss function of the present inventionssWhere N is the number of local blocks that can be extracted from the image and N is the index of the local block.
Ls=λ1Lp+λ1Lss+λ3Lsm(5)
The penalty per scale s is therefore a weighted sum of the pixel reconstruction penalty, the piecewise smoothing penalty and the region-based SSIM penalty,
δsthe settings are to balance the losses on each scale and are identical in number.
And the feature extraction part is also similar to the CNN structure of the spatial stream. The optical flow estimated by the optical flow estimation portion needs to be normalized before being sent to the CNN where the features are extracted. A limit of more than 20 pixels is first limited to 20 pixels. It is then normalized to range between 0 and 255. This normalization is important for good time-stream performance. The last temporal stream extracts features that contain optical flow information.
Two, cooperation learning module
For both streams and their derivatives, careful inspection of their models reveals that for most misclassification cases, one stream usually fails, while the other is still correct, affecting the overall recognition accuracy. Therefore, it is not sufficient to simply average the output of the classifier layers. Rather, the present invention seeks to facilitate spatial and temporal cues in relation to each other. In order to capture the interaction of spatial (static) information and temporal (motion) information, the invention expects the static features and the motion features to interact, so the collaborative learning module of the invention mutually guides optimization of the static features and the motion features by using the static and motion information of which the static and motion information have a symmetrical structure.
At time t, optimizing optical flow characteristics by using the frame characteristics, wherein the specific formula is as follows:
wherein: optical flow characteristics:1 is a vector with all elements 1, OsIs a video feature, z, of a combination of frame features obtained from time t-1mIs a learning optimization coefficient on the optical flow characteristics, OmIs a video feature merged from the optical flow, Wm、Andis a weight parameter.
At time t +1, frame features are optimized using optical flow featuresIn the collaborative learning module, inputting: extracting frame and optical flow characteristics from the hidden double-flow model; and (3) outputting: optimized frame featuresAnd optical flow features
The specific algorithm steps are as follows:
1. initializing optimization coefficients on frame features to zs,zsAll N elements are set to be 1/N;
2. repeating the steps for one time;
4. Using O by equations (7) and (8)sOptimizing optical flow characteristics and obtaining an optimization coefficient z on the optical flow characteristicsm;
5. Optical flow feature V by equation (9)mMerging into a single vector as a video feature;
6. using OmOptimizing frame characteristics and obtaining an optimization coefficient z on the frame characteristicss;
7. Until the loss function converges;
Since the predicted scores (static and motion) for each stream have been obtained, we can simply summarize these scores and obtain the final result. However, static and motion information contribute differently to different video categories. Some categories, such as "archery" and "smoke", should not be identified primarily from static frames (static information), while some categories contain significant motion, which is important for distinguishing categories, such as "walking and" crawling ". Thus, the present invention adaptively learns different fusion weights for different classes of static and motion streams.
Predictive score for ith training data in jth classWhere c represents the number of categories,a score representing the mth stream of the ith training data in the jth category.
The present invention represents the fusion weight of the jth class asLimiting The fusion weight of each category is learned separately, and the weight W is obtainedj。
The objective function is as follows: (λ is set to 5 × 10)-3)
pjThe definition is as follows:
wherein N isjIndicates the number of training data of the jth class,where the jth element is 1 and the other elements are 0, the way to maximize Pj is to maximize the product of the jth column vectors of Wj and Sji, which also means minimizing Wj and Sk iIs equal to j, is calculated as the product of the jth column vector of (k is not equal to j). Pj and Nj consider the relationship of positive and negative training data for Wj, respectively, and are parameters that balance the weights of the positive and negative samples. Equation (10) can then be transformed into:
the fusion weights are learned by linear programming, and for the test data the softmax score for each stream is first calculated and superimposed, as:
by the formulaAnd (3) prediction:the final classification result is determined by the highest fusion score.
The following beneficial effects can be obtained through the scheme:
1. the hidden time stream can estimate the optical flow information of the video through the frame sequence of the video, and then the category of the video behavior is directly obtained by combining the spatial stream. The calculation cost required for extracting the optical flow image in advance is not required.
2. The invention can capture the interaction of static information (space flow) and motion information (hidden time flow) through a collaborative learning module, and mutually enhance the time and space characteristics. The identification precision of video behavior identification is improved.
3. The invention can save the storage space required by the prior method for storing the optical flow image in advance.
In the description herein, references to the description of the term "one embodiment," "another embodiment," or "first through xth embodiments," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, method steps, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.
Claims (7)
1. The hidden double-flow cooperative learning network for behavior recognition is characterized by comprising a hidden double-flow model and a cooperative learning model; the hidden double-flow model comprises a space flow for extracting discriminant static characteristics and a hidden time flow for obtaining motion characteristics; the collaborative learning model is used for optimizing static and motion characteristics, adaptively learning the fusion weight of each video category and finally obtaining a prediction result.
2. The implicit dual-stream collaborative learning network for behavior recognition according to claim 1, wherein the spatial stream is used to input static frames of video into a convolutional neural network, capturing static features of pictures.
3. The hidden-dual-flow collaborative learning network for behavior recognition according to claim 1, wherein the hidden temporal flow is divided into an optical flow estimation portion and a feature extraction portion for extracting motion features in the optical flow estimated by the optical flow estimation portion, wherein the network of optical flow estimation portions computes a plurality of penalties on a plurality of scales, the penalty on each scale being a weighted sum of a standard pixel reconstruction penalty, a smoothness penalty, and a region-based structural similarity penalty.
4. The implicit dual-flow collaborative learning network for behavior recognition according to claim 3, wherein the function of the standard pixel reconstruction loss is:
wherein,is the estimated optical flow in the horizontal and vertical directions of the pixel point (a, b), h and w representing the image I1And I2Height and width of;
the smoothness loss function is:
wherein,andis the estimated optical flow field V in each directionxThe gradient of (a) of (b) is,andis optical flow field VyA gradient of (a);
the structural similarity loss function is:
5. The implicit dual-flow collaborative learning network for behavior recognition according to any of claims 1-4, wherein the step of optimizing static and motion features in the collaborative learning model comprises:
(1) initializing optimization coefficients on frame features to zs,zsAll N elements are set to be 1/N;
(2) repeating the steps for one time;
(4) By passingAndusing OsOptimizing optical flow characteristics and obtaining an optimization coefficient z on the optical flow characteristicsm;
(5) By passingCharacterizing the optical flow VmAre combined intoA single vector is used as a video feature;
(6) using OmOptimizing frame characteristics and obtaining an optimization coefficient z on the frame characteristicss;
(7) Until the loss function converges;
Wherein the frame feature is a static feature, the optical flow feature is a motion feature, and the optical flow feature is 1 is a vector with all elements 1, OsIs a video feature, z, of a combination of frame features obtained from time t-1mIs a learning optimization coefficient on the optical flow characteristics, OmIs a video feature merged from the optical flow, Wm、Andis a weight parameter, a frame characteristic
6. The implicit dual-flow collaborative learning network for behavior recognition according to any one of claims 1 to 4, wherein the step of adaptively learning the fusion weight of each video category in the collaborative learning model to obtain the prediction result comprises:
different fusion weights for different classes of static and motion streams are learned adaptively, with the final classification result determined by the highest fusion score.
7. A hidden double-flow cooperative learning method for behavior recognition is characterized by comprising the following steps of:
an input video is firstly decomposed into a frame sequence, then the frame sequence is respectively sent into the spatial stream extraction discriminant static characteristics and the hidden time stream extraction discriminant motion characteristics of the hidden double-stream cooperative learning network according to any one of claims 1 to 6, after the characteristics are obtained, the cooperative learning network is carried out to optimize the static and motion characteristics, the fusion weight of each video category is adaptively learned, and finally, a prediction result is obtained.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911189752.6A CN110889375B (en) | 2019-11-28 | 2019-11-28 | Hidden-double-flow cooperative learning network and method for behavior recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911189752.6A CN110889375B (en) | 2019-11-28 | 2019-11-28 | Hidden-double-flow cooperative learning network and method for behavior recognition |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110889375A true CN110889375A (en) | 2020-03-17 |
CN110889375B CN110889375B (en) | 2022-12-20 |
Family
ID=69749221
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911189752.6A Active CN110889375B (en) | 2019-11-28 | 2019-11-28 | Hidden-double-flow cooperative learning network and method for behavior recognition |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110889375B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111582230A (en) * | 2020-05-21 | 2020-08-25 | 电子科技大学 | Video behavior classification method based on space-time characteristics |
CN111639548A (en) * | 2020-05-11 | 2020-09-08 | 华南理工大学 | Door-based video context multi-modal perceptual feature optimization method |
CN111931603A (en) * | 2020-07-22 | 2020-11-13 | 北方工业大学 | Human body action recognition system and method based on double-current convolution network of competitive combination network |
CN112025692A (en) * | 2020-09-01 | 2020-12-04 | 广东工业大学 | Control method and device for self-learning robot and electronic equipment |
CN112115788A (en) * | 2020-08-14 | 2020-12-22 | 咪咕文化科技有限公司 | Video motion recognition method and device, electronic equipment and storage medium |
CN112347996A (en) * | 2020-11-30 | 2021-02-09 | 上海眼控科技股份有限公司 | Scene state judgment method, device, equipment and storage medium |
CN112767645A (en) * | 2021-02-02 | 2021-05-07 | 南京恩博科技有限公司 | Smoke identification method and device and electronic equipment |
CN114627397A (en) * | 2020-12-10 | 2022-06-14 | 顺丰科技有限公司 | Behavior recognition model construction method and behavior recognition method |
CN114821760A (en) * | 2021-01-27 | 2022-07-29 | 四川大学 | Human body abnormal behavior detection method based on double-flow space-time automatic coding machine |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150365696A1 (en) * | 2014-06-13 | 2015-12-17 | Texas Instruments Incorporated | Optical flow determination using pyramidal block matching |
CN105678216A (en) * | 2015-12-21 | 2016-06-15 | 中国石油大学(华东) | Spatio-temporal data stream video behavior recognition method based on deep learning |
CN107220616A (en) * | 2017-05-25 | 2017-09-29 | 北京大学 | A kind of video classification methods of the two-way Cooperative Study based on adaptive weighting |
CN109325430A (en) * | 2018-09-11 | 2019-02-12 | 北京飞搜科技有限公司 | Real-time Activity recognition method and system |
US20190205629A1 (en) * | 2018-01-04 | 2019-07-04 | Beijing Kuangshi Technology Co., Ltd. | Behavior predicton method, behavior predicton system, and non-transitory recording medium |
CN110188637A (en) * | 2019-05-17 | 2019-08-30 | 西安电子科技大学 | A kind of Activity recognition technical method based on deep learning |
-
2019
- 2019-11-28 CN CN201911189752.6A patent/CN110889375B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150365696A1 (en) * | 2014-06-13 | 2015-12-17 | Texas Instruments Incorporated | Optical flow determination using pyramidal block matching |
CN105678216A (en) * | 2015-12-21 | 2016-06-15 | 中国石油大学(华东) | Spatio-temporal data stream video behavior recognition method based on deep learning |
CN107220616A (en) * | 2017-05-25 | 2017-09-29 | 北京大学 | A kind of video classification methods of the two-way Cooperative Study based on adaptive weighting |
US20190205629A1 (en) * | 2018-01-04 | 2019-07-04 | Beijing Kuangshi Technology Co., Ltd. | Behavior predicton method, behavior predicton system, and non-transitory recording medium |
CN109325430A (en) * | 2018-09-11 | 2019-02-12 | 北京飞搜科技有限公司 | Real-time Activity recognition method and system |
CN110188637A (en) * | 2019-05-17 | 2019-08-30 | 西安电子科技大学 | A kind of Activity recognition technical method based on deep learning |
Non-Patent Citations (2)
Title |
---|
刘天亮等: "融合空间-时间双网络流和视觉注意的人体行为识别", 《电子与信息学报》 * |
杨妙: "用于行为识别的双流卷积神经网络微调算法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111639548A (en) * | 2020-05-11 | 2020-09-08 | 华南理工大学 | Door-based video context multi-modal perceptual feature optimization method |
CN111582230A (en) * | 2020-05-21 | 2020-08-25 | 电子科技大学 | Video behavior classification method based on space-time characteristics |
CN111931603A (en) * | 2020-07-22 | 2020-11-13 | 北方工业大学 | Human body action recognition system and method based on double-current convolution network of competitive combination network |
CN111931603B (en) * | 2020-07-22 | 2024-01-12 | 北方工业大学 | Human body action recognition system and method of double-flow convolution network based on competitive network |
CN112115788A (en) * | 2020-08-14 | 2020-12-22 | 咪咕文化科技有限公司 | Video motion recognition method and device, electronic equipment and storage medium |
CN112025692A (en) * | 2020-09-01 | 2020-12-04 | 广东工业大学 | Control method and device for self-learning robot and electronic equipment |
CN112347996A (en) * | 2020-11-30 | 2021-02-09 | 上海眼控科技股份有限公司 | Scene state judgment method, device, equipment and storage medium |
CN114627397A (en) * | 2020-12-10 | 2022-06-14 | 顺丰科技有限公司 | Behavior recognition model construction method and behavior recognition method |
CN114821760A (en) * | 2021-01-27 | 2022-07-29 | 四川大学 | Human body abnormal behavior detection method based on double-flow space-time automatic coding machine |
CN114821760B (en) * | 2021-01-27 | 2023-10-27 | 四川大学 | Human body abnormal behavior detection method based on double-flow space-time automatic encoder |
CN112767645A (en) * | 2021-02-02 | 2021-05-07 | 南京恩博科技有限公司 | Smoke identification method and device and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN110889375B (en) | 2022-12-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110889375B (en) | Hidden-double-flow cooperative learning network and method for behavior recognition | |
CN109711316B (en) | Pedestrian re-identification method, device, equipment and storage medium | |
Sun et al. | Lattice long short-term memory for human action recognition | |
CN107679491B (en) | 3D convolutional neural network sign language recognition method fusing multimodal data | |
CN105095862B (en) | A kind of human motion recognition method based on depth convolution condition random field | |
CN112149459B (en) | Video saliency object detection model and system based on cross attention mechanism | |
CN109919122A (en) | A kind of timing behavioral value method based on 3D human body key point | |
CN110334589B (en) | High-time-sequence 3D neural network action identification method based on hole convolution | |
CN110120064B (en) | Depth-related target tracking algorithm based on mutual reinforcement and multi-attention mechanism learning | |
CN112926396A (en) | Action identification method based on double-current convolution attention | |
CN110378208B (en) | Behavior identification method based on deep residual error network | |
CN111260738A (en) | Multi-scale target tracking method based on relevant filtering and self-adaptive feature fusion | |
CN111931603B (en) | Human body action recognition system and method of double-flow convolution network based on competitive network | |
CN106650617A (en) | Pedestrian abnormity identification method based on probabilistic latent semantic analysis | |
CN114782977B (en) | Pedestrian re-recognition guiding method based on topology information and affinity information | |
CN113963032A (en) | Twin network structure target tracking method fusing target re-identification | |
CN114998601B (en) | On-line update target tracking method and system based on Transformer | |
CN112232134A (en) | Human body posture estimation method based on hourglass network and attention mechanism | |
Yi et al. | Human action recognition based on action relevance weighted encoding | |
CN116168329A (en) | Video motion detection method, equipment and medium based on key frame screening pixel block | |
CN114708649A (en) | Behavior identification method based on integrated learning method and time attention diagram convolution | |
CN115410131A (en) | Method for intelligently classifying short videos | |
Li et al. | Robust foreground segmentation based on two effective background models | |
CN111027472A (en) | Video identification method based on fusion of video optical flow and image space feature weight | |
CN112528077B (en) | Video face retrieval method and system based on video embedding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |