CN108764019A - A kind of Video Events detection method based on multi-source deep learning - Google Patents
A kind of Video Events detection method based on multi-source deep learning Download PDFInfo
- Publication number
- CN108764019A CN108764019A CN201810290777.4A CN201810290777A CN108764019A CN 108764019 A CN108764019 A CN 108764019A CN 201810290777 A CN201810290777 A CN 201810290777A CN 108764019 A CN108764019 A CN 108764019A
- Authority
- CN
- China
- Prior art keywords
- video
- door
- output
- indicate
- frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a kind of Video Events detection methods based on multi-source deep learning, include the following steps:Gray processing is carried out to each frame of video and interframe makes the difference acquisition three-dimensional array, three-dimensional array is inputted in C3D networks, trained model carries out feature extraction in advance, obtains characteristics of image using C3D algorithms, obtains 4096 dimensional feature vectors;Using the RGB image of original video first frame as input, picture feature is extracted in CNN network structures, using fc7 layers of output as feature, obtains the feature vector of 2048 dimensions;Feature vector is spliced, the vector of 6124 dimensions is obtained, after processing to all training videos and be converted into vector, carries out dimensionality reduction, converts the data of 6124 dimensions to the data of 256 dimensions;Start model training stage after each pending video is mapped as the data of one 256 dimension;Unknown sample is tested using trained model.The present invention greatly improves the efficiency and accuracy rate of video monitoring information processing.
Description
Technical field
The present invention relates to Video Events detection field more particularly to a kind of Video Events detections based on multi-source deep learning
Method.
Background technology
The processing of video monitoring is in an increasingly wide range of applications in fields such as public safety, business.Video monitoring is with it
Intuitively, accurately, it is abundant with the information content in time and be widely used in many occasions.In recent years, with computer, network and figure
As processing, the rapid development of transmission technology, there has also been significant progresses for Video Supervision Technique.However, the place of current video monitoring
Reason depends on artificial treatment, and costly, processing speed and efficiency are also relatively slow for cost.
A variety of methods have been proposed at present to detect Video Events.First, the human motion of unmarked view-based access control model
Analysis is likely to provide a cheap, unobtrusive method to estimate the posture of human body.Therefore, it is widely used in transporting
Dynamic analysis.Fujiyoshi etc. (such as bibliography [1]) proposes " fixed star " skeletonizing process to analyze target movement.Secondly, row
Dynamic or collective activity identification can teach that the existence of group event in video.
A kind of video actions identification of the shallow higher-dimension coding based on early stage local space time's feature is had also been proposed in the prior art
Method.Sparse space-time point of interest can with local space time come Expressive Features, including:The histogram (HOG) of gradient vector and
Light stream histogram (HOF) (such as bibliography [2]).These characteristics are then encoded into feature packet (BoF) description (as with reference to text
Offer [3]), then support vector machines is used to carry out classification task.In addition, being dedicated in video in a large amount of related work of last decade
Group work research.Work before the overwhelming majority be all using the feature of engineer come individual aerial when stating (such as
Bibliography [4]).Lan etc. (such as bibliography [5]), which is proposed, can represent the layer from lower people and information to higher group
It is secondary with interactive level relationship, and being capable of potential adaptive Structure learning.
Recently, multi-task learning method has been applied to mode of human's social activity identification.Wherein, Liu et al. (such as bibliography [6])
It proposes a kind of hierarchical clustering multi-task learning human behavior is grouped and is identified.Again, video frequency abstract is to be used for vision
Another method for understanding and showing.There are several methods that can generate video frequency abstract from a long video.A kind of representativeness side
Method is to appear in video in different time periods to an object and activity to generate summary.Pritch etc. (such as bibliography [7]) is also
A kind of new method is proposed, it can be according to similar events or activities fasciation at short and coherent video outline.Another method
Generate text based abstract.Chu et al. (such as bibliography [8]) proposes a multimedia analysis frame while handling video
And text, the relationship between entity is built jointly by scene graph to understand event (such as bibliography [9]).
Current most methods are required for handling multinomial challenging visual analysis task.Lee (such as bibliography
[10]) an effective Gaussian Mixture learning method is proposed for video background removal.Dai etc. (such as bibliography [11]) is carried
A kind of R-FCN object detection networks of robust are gone out.Although existing method shows on the problem of handling some aspects
Validity is gone out, the processing for being directed to automatic understanding monitor video still has many challenges and limitation.Main challenge comes
From following two aspects:The problem of complexity and processing method of data.It is directed to data itself, main challenge is
Resolution ratio is low, data volume is big, event set and scene are complicated, data source occlusion.
For method, mainly there is limitation below:
1) many methods depend on prospect background cutting techniques, however this technology can cause mistake cumulative.
2) many methods depend on detect and track, however for different videos and mobile object, detect and track
Robustness it is relatively low, these disadvantages reduce the efficiency of time analysis.
3) when data volume increases, calculation amount can be substantially improved.
4) the problem of most of the event detection in real-life is multi-tag, is especially among monitor video, more
A event can occur simultaneously.However, action recognition and group activity identification are all based on the event detecting method of single label.Cause
This both recognition methods can lose the simultaneous time.
5) most methods are only with a kind of feature extracting method, and feature is more single, the tables of data being susceptible in root
Sign is inaccurate.
Invention content
The present invention provides a kind of Video Events detection methods based on multi-source deep learning, and the present invention is based on multi-source informations
Video information is handled, original model can be optimized based on new data, due to using C3D features and tradition CNN features
It is merged, C3D features can be added in the case where keeping original characteristic maturation to be easy to the advantage of extraction, it is more accurate to be carried out to video
Characterization, can greatly improve video monitoring information processing efficiency and accuracy rate, it is described below:
A kind of Video Events detection method based on multi-source deep learning, the described method comprises the following steps:
Gray processing is carried out to each frame of video and adjacent video frames are carried out to subtract each other processing acquisition three-dimensional array, it will
Three-dimensional array inputs in C3D networks, and trained model carries out feature extraction in advance, obtains characteristics of image using C3D algorithms, obtains
To 4096 dimensional feature vectors;
Using the RGB image of original video first frame as input, picture feature is extracted in CNN network structures, by fc7 layers
Output as feature, obtain the feature vector of 2048 dimensions;
Feature vector is spliced, the vector of 6124 dimensions is obtained, all training videos are processed and be converted into
After amount, dimensionality reduction is carried out, converts the data of 6124 dimensions to the data of 256 dimensions;
Start model training stage after each pending video is mapped as the data of one 256 dimension;Using trained
Model tests unknown sample.
Each frame progress gray processing and interframe to video makes the difference acquisition three-dimensional array:
For a given image sequence x={ x1, x2..., xn, corresponding tally set is y={ y1, y2...,
ym, each frame of video is subjected to gray processing first, then the second frame makes the difference with first frame, and third frame makes the difference with the second frame, with
This analogizes, and just converts original video data for another three-dimensional array by a three-dimensional array.
The C3D networks are specially:
All frames of sequence of video images x are divided into the picture group of one group of 8 frame, every 8 frame exports the fc7 numbers of plies of a C3D
According to as feature extraction as a result, the feature vector of 4096 dimensions of k is obtained, finally using the fc7 layer datas in network structure as special
Sign is exported, and 4096 dimensional feature vectors are obtained.
The CNN network structures are specially:The structural network that convolutional layer, full articulamentum are connected with each other with pond layer, such as Fig. 5
It is shown.
The beginning model training stage specifically includes:
By the label of action to be identified,<BOS>Label is included in candidate word;
By maximizing class index function learning, maximum likelihood function method prediction output weight;
Softmax layers of loss is calculated and is propagated in LSTM.
The method further includes:
In decoding stage, the probability distribution of a hidden state is built by following formula:
Wherein, p (yt|hn+t-1,yt-1) it is to be obtained based on whole word softmax functions, htIt is the hidden state of t steps,
Hidden state is depended not only in the output of t steps, additionally depends on the output of previous step t-1.
It is described that unknown sample is tested specially using trained model:
With<BOS>Label starts, by all hidden state value h in obtained sequence of video images xtIt is input to second
In LSTM network systems, variable below is calculated separately:
f1t=σ (W1xfzt+W1zfzt-1+b1f)
i1t=σ (W1xiht+W1zizt-1+b1i)
g1t=tanh (W1xght+W1zgzt-1+b1g)
c1t=f1t⊙c1(t-1)+i1t⊙g1t
o1t=σ (W1ofht+W1zozt-1+b1o)
zt=o1t⊙tanh(c1t)
Wherein, σ is the other logic sigmoid functions of Element-Level, and tanh is hyperbolic tangent function, and ⊙ is that element respective items multiply
Method;W1xfIndicate the weight matrix of input and the door for forgeing door outlet chamber, W1zfIndicate output valve and the door of forgetting door outlet chamber
Weight matrix, W1xiIndicate the weight matrix of input gate output and the door for forgeing door outlet chamber, W1ziIndicate output valve and input gate
The weight matrix of the door of outlet chamber, W1xgIt indicates input and calculates new c1tCandidate value between door weight matrix, W1zgIt indicates
The output valve c new with calculating1tCandidate value between door weight matrix, W1ofIt indicates out gate output and forgets door outlet chamber
The weight matrix of door, W1zoIndicate the weight matrix of output valve and the door of out gate outlet chamber, b1fIt indicates to forget hiding for door output
State, b1iTo indicate that input gate exports hidden state, b1gIt indicates to calculate new c1tCandidate value hidden state, b1oIndicate defeated
It gos out the hidden state of output, f1tIt indicates to forget door output, i1tIndicate input gate output, o1tIndicate the output of out gate.c1tIt is
Cell member state value, ztIt is output valve, g1tIt is to calculate new c1tCandidate value.
The advantageous effect of technical solution provided by the invention is:
1, multiple events that video monitoring is generated by this method are reported, have avoided object monitoring and tracking process;
2, video is included by this method image information and action message, spatial information and temporal information are in different depth
It is separately handled in network, has accomplished multisource data fusion;
3, this method devises completely new binary-flow network framework for video monitoring processing, and improves treatment effeciency and place
Rationality energy.
Description of the drawings
Fig. 1 is a kind of flow chart of the Video Events detection method based on multi-source deep learning;
Fig. 2 is bilateral LSTM network structures;
Fig. 3 is the flow diagram of C3D convolution;
Fig. 4 is the schematic network structure of C3D;
Fig. 5 is the schematic network structure that CNN networks extract picture feature.
Specific implementation mode
To make the object, technical solutions and advantages of the present invention clearer, embodiment of the present invention is made below further
It is described in detail on ground.
Embodiment 1
An embodiment of the present invention provides a kind of Video Events detection methods based on multi-source deep learning, referring to Fig. 1, the inspection
Survey method includes the following steps:
101:Gray processing is carried out to each frame of video and interframe makes the difference acquisition three-dimensional array, by three-dimensional array
It inputs in C3D networks, trained model carries out feature extraction in advance, obtains characteristics of image using C3D algorithms, obtains 4096 dimensions
Feature vector;
102:Using the RGB image of original video first frame as input, picture feature is extracted in CNN network structures, it will
Fc7 layers of output obtains the feature vector of 2048 dimensions as feature;
103:Feature vector is spliced, the vector of 6124 dimensions is obtained, all training videos are processed and converted
After vector, dimensionality reduction is carried out, converts the data of 6124 dimensions to the data of 256 dimensions;
104:Start model training stage after each pending video is mapped as the data of one 256 dimension;Utilize training
Good model tests unknown sample.
In conclusion the embodiment of the present invention, which is based on multi-source information, handles video information, it can be based on new data to original
Model optimizes, and greatly improves the efficiency and accuracy rate of video monitoring information processing.
Embodiment 2
The scheme in embodiment 1 is further introduced with reference to specific calculation formula, example, it is as detailed below
Description:
C3D is a job of facebook, and network is constructed using 3D convolution sum 3D Pooling.By 3D convolution,
C3D can directly handle video (or perhaps volume of video frame).The sharpest edges of C3D are that its speed, speed are
314fps.And actually this is based on the video card before 2 years.It can reach 600fps or more with 1080 video cards of Nvidia.Institute
Efficiency with C3D is other methods to be significantly larger than.
201:C3D algorithms are used to extract video features first;
Convolutional neural networks (CNN) are widely used in computer vision in recent years, including:Classification, detection, segmentation etc. are appointed
Business.These tasks are typically all to be carried out for image, use two-dimensional convolution (i.e. the dimension of convolution kernel is two dimension).And it is right
In based on video analysis the problem of, 2D convolution cannot the fine information that must be captured in sequential.Therefore 3D convolution just has also been proposed.
C3D is the improvement based on original convolutional neural networks algorithm, and original each convolutional layer of convolutional neural networks makes
Convolution is carried out with two-dimensional convolution kernel, has changed three dimensional convolution kernel among C3D algorithms, can thus handle multiple image number
According to.Used network structure is as shown in Figure 3.The classification that one group of every 8 frame is carried out for the video of input, then inputs it
C3D algorithms obtain multigroup 4096 dimensional vector.
202:For a given image sequence x={ x1, x2..., xn, corresponding tally set is y={ y1,
y2..., ym, each frame of video is subjected to gray processing first, then the second frame makes the difference with first frame, and third frame is done with the second frame
Difference, and so on, in this way, just original video data is converted by a three-dimensional array for another three-dimensional array;
203:Using the obtained transformed three-dimensional array of above-mentioned steps as input, C3D networks as shown in Figure 3 are inputted
In, feature extraction is carried out with the Sport1M models of advance trained 1,900,000 iteration on other data sets, is calculated using C3D
Method obtains characteristics of image;
Referring to Fig. 3, which is specially:For the sequence of video images x={ x of each input1, x2...,
xt..., xn, wherein x1, x2..., xt..., xnCorrespond to respectively the 1st frame in sequence of video images x, the 2nd frame ..., t frames ...,
All frames of sequence of video images x, are divided into the picture group of one group of 8 frame by n-th frame image, and every 8 frame exports the fc7 layers of a C3D
For data as feature extraction as a result, obtaining the feature vector of k 4096 dimension, wherein k is 8 downward roundings of n ÷.Finally by network knot
Fc7 layer datas in structure are exported as feature, obtain 4096 dimensional feature vectors.
204:Using the RGB image of original video first frame as inputting, figure is extracted in CNN network structures shown in Fig. 4
Piece feature obtains the feature vector of one 2048 dimension using fc7 layers of output as feature;
Referring to Fig. 4, which is specially:The Structure Network that convolutional layer, full articulamentum are connected with each other with pond layer
Network.
205:Then the feature vector that upper two steps are extracted is spliced, obtains the vector of one 6124 dimension, for
All training videos do treatment thereto, are completely converted into after vector, carry out dimensionality reduction, convert the data of 6124 dimensions to
The data of 256 dimensions;
206:Start model training rank after each pending video is mapped as the data of one 256 dimension by 205 steps
Section;
When decoding the training stage, such as regulation has n actions to be identified, and the form of label is that label 1 arrives n, this n is a
Label is also known as candidate word.In data prediction by the action that the video occurs connect as sentence (such as:1,3,6
It represents and 1,3,6 these three actions successively has occurred in video.), in addition, in general, starting to need a mark due to candidate word
Will, that is, so-called<BOS>Label, therefore will<BOS>Label is also included in candidate word<BOS>,1,3,6.).
Then, this frame is attempted (to ensure (5) by maximum likelihood function method by maximizing class index function learning
Maximum value θ is got in formulamaxProbability distribution, then the hidden state that is obtained by probability distribution in (2) formula and defeated in last moment
Go out (prediction output can be arbitrary initial value as t=0) prediction output weight.
Finally, softmax layers of loss is calculated and is propagated based on (4) formula in LSTM, and LSTM is for RNN (cycle god
Through network) upgrading variant, this method enables previous information to be contacted with current task, and computer is allow to handle
It is this kind of problem of different length sequence to output and input.For the input x of t momentt, can be based on by formula below
ht-1And xtContinuously calculate variable below:(1)
ft=σ (Wxfxt+Whfht-1+bf)
it=σ (Wxixt+Whiht-1+bi)
gt=tanh (Wxgxt+Whght-1+bg)
ct=ft⊙ct-1+it⊙gt
ot=σ (Wofxt+Whoht-1+bo)
ht=ot⊙tanh(ct)
Wherein, σ is the other logic sigmoid functions of Element-Level, and tanh is hyperbolic tangent function, and ⊙ is that element respective items multiply
Method;WxfIndicate the weight matrix of input and the door for forgeing door outlet chamber, WhfIt indicates hidden state value and forgets the door of door outlet chamber
Weight matrix, WxiIndicate the weight matrix of input gate output and the door for forgeing door outlet chamber, WhiIndicate hidden state value with it is defeated
The weight matrix of the door for outlet chamber of geting started, WxgIt indicates input and calculates new ctCandidate value between door weight matrix, WhgTable
Show hidden state value and calculates new ctCandidate value between door weight matrix, WofIt indicates out gate output and forgets door output
Between door weight matrix, WhoIndicate the weight matrix of hidden state value and the door of out gate outlet chamber, bfIndicate that forgetting door is defeated
The hidden state gone out, biIndicate that input gate exports hidden state, bgIt indicates to calculate new ctCandidate value hidden state, boTable
Show the hidden state of out gate output, ftIt indicates to forget door output, itIndicate input gate output, otIndicate the output of out gate.ct
It is cell member state value, htIt is hidden state value, gtIt is to calculate new ctCandidate value.
LSTM is also a kind of tool of coder-decoder framework.Assuming that input is x={ x1, x2..., xn, and it is defeated
It is y={ y to go out1, y2..., ym}.In coding stage, x is encoded, and calculates above-mentioned variable by LSTM.In decoding rank
Section, the probability distribution of a hidden state is built by the algorithmic formula of designed, designed:
Wherein, p (yt|hn+t-1,yt-1) it is to be obtained based on whole word (label) softmax functions, htIt is the hidden of t steps
Tibetan state is calculated by (1) formula.Hidden state is depended not only in the output of t steps, additionally depends on the defeated of previous step t-1
Go out.
207:Unknown sample is tested using step 206 trained model.
In test phase, the video inputted as the processing that hereinbefore training step is done, obtain one 256 dimension
Feature, decoding start.With<BOS>Label starts, by all hidden states in the sequence of video images x obtained in step 206
Value ht, t=1,2 ..., n are input in second exclusive LSTM network system of this method, calculate separately variable below:
Wherein, σ is the other logic sigmoid functions of Element-Level, and tanh is hyperbolic tangent function, and ⊙ is that element respective items multiply
Method;W1xfIndicate the weight matrix of input and the door for forgeing door outlet chamber, W1zfIndicate output valve and the door of forgetting door outlet chamber
Weight matrix, W1xiIndicate the weight matrix of input gate output and the door for forgeing door outlet chamber, W1ziIndicate output valve and input gate
The weight matrix of the door of outlet chamber, W1xgIt indicates input and calculates new c1tCandidate value between door weight matrix, W1zgIt indicates
The output valve c new with calculating1tCandidate value between door weight matrix, W1ofIt indicates out gate output and forgets door outlet chamber
The weight matrix of door, W1zoIndicate the weight matrix of output valve and the door of out gate outlet chamber, b1fIt indicates to forget hiding for door output
State, b1iTo indicate that input gate exports hidden state, b1gIt indicates to calculate new c1tCandidate value hidden state, b1oIndicate defeated
It gos out the hidden state of output, f1tIt indicates to forget door output, i1tIndicate input gate output, o1tIndicate the output of out gate.c1tIt is
Cell member state value, ztIt is output valve, g1tIt is to calculate new c1tCandidate value.
Export ztIn obtain be each word the score for each label in vocabulary.Then for each label
Carry out mean value pond.This value can be used for indicating the possibility that each time occurs in the segment.By in (6) formula
Sigmoid functions generate a probability distribution, obtain the probability that each action occurs in corresponding video, then by giving threshold
Value (is generally set to 0.5 and is compared the label 0 or 1 for judging whether the action occurs in video), finally will be each potential dynamic
The label of work is integrated, and the prediction result for whole section of video is obtained.
In conclusion the embodiment of the present invention, which is based on multi-source information, handles video information, it can be based on new data to original
Model optimizes, and greatly improves the efficiency and accuracy rate of video monitoring information processing.
Bibliography
[1]Gutchess D,Trajkovics M,Cohen-Solal E,et al.A background model
initialization algorithm for video surveillance[C]//Computer Vision,2001.ICCV
2001.Proceedings.Eighth IEEE International Conference on.IEEE,2001,1:733-740.
[2]Fan C,Crandall D J.Deepdiary:Automatically captioning lifelogging
image streams[C]//European Conference on Computer Vision.Springer
International Publishing,2016:459-473.
[3]Lazebnik S,Schmid C,Ponce J.Beyond bags of features:Spatial
pyramid matching for recognizing natural scene categories[C]//Computer vision
and pattern recognition,2006IEEE computer society conference on.IEEE,2006,2:
2169-2178.
[4]Ibrahim M S,Muralidharan S,Deng Z,et al.A hierarchical deep
temporal model for group activity recognition[C]//Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition.2016:1971-1980.
[5]Lan T,Wang Y,Yang W,et al.Discriminative latent models for
recognizing contextual group activities[J].IEEE Transactions on Pattern
Analysis and Machine Intelligence,2012,34(8):1549-1562.
[6]Liu A A,Su Y T,Nie W Z,et al.Hierarchical clustering multi-task
learning for joint human action grouping and recognition[J].IEEE transactions
on pattern analysis and machine intelligence,2017,39(1):102-114.
[7]Pritch Y,Ratovitch S,Hendel A,et al.Clustered synopsis of
surveillance video[C]//Advanced Video and Signal Based Surveillance,
2009.AVSS'09.Sixth IEEE International Conference on.IEEE,2009:195-200.
[8]Tu K,Meng M,Lee M W,et al.Joint video and text parsing for
understanding events and answering queries[J].IEEE MultiMedia,2014,21(2):42-
70.
[9]He X,Gao M,Kan M Y,et al.Birank:Towards ranking on bipartite
graphs[J].IEEE Transactions on Knowledge and Data Engineering,2017,29(1):57-
71.
[10]Hochreiter S,Schmidhuber J.LSTM can solve hard long time lag
problems[C]//Advances in neural information processing systems.1997:473-479.
[11]Venugopalan S,Rohrbach M,Donahue J,et al.Sequence to Sequence--
Video to Text[C]//IEEE International Conference on Computer Vision.IEEE,2015:
4534-4542.
To the model of each device in addition to doing specified otherwise, the model of other devices is not limited the embodiment of the present invention,
As long as the device of above-mentioned function can be completed.
It will be appreciated by those skilled in the art that attached drawing is the schematic diagram of a preferred embodiment, the embodiments of the present invention
Serial number is for illustration only, can not represent the quality of embodiment.
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all the present invention spirit and
Within principle, any modification, equivalent replacement, improvement and so on should all be included in the protection scope of the present invention.
Claims (7)
1. a kind of Video Events detection method based on multi-source deep learning, which is characterized in that the described method comprises the following steps:
Gray processing is carried out to each frame of video and interframe makes the difference acquisition three-dimensional array, three-dimensional array is inputted into C3D nets
In network, trained model carries out feature extraction in advance, obtains characteristics of image using C3D algorithms, obtains 4096 dimensional feature vectors;
Using the RGB image of original video first frame as input, picture feature is extracted in CNN network structures, it is defeated by fc7 layers
Go out the feature vector that 2048 dimensions are obtained as feature;
Feature vector is spliced, the vector of 6124 dimensions is obtained, all training videos are processed and is converted into vector
Afterwards, dimensionality reduction is carried out, converts the data of 6124 dimensions to the data of 256 dimensions;
Start model training stage after each pending video is mapped as the data of one 256 dimension;Utilize trained model
Unknown sample is tested.
2. a kind of Video Events detection method based on multi-source deep learning according to claim 1, which is characterized in that institute
It states and gray processing is carried out to each frame of video and interframe makes the difference and obtains three-dimensional array and be specially:
For a given image sequence x={ x1, x2..., xn, corresponding tally set is y={ y1, y2..., ym, it is first
Each frame of video is first subjected to gray processing, then the second frame makes the difference with first frame, and third frame makes the difference with the second frame, with such
It pushes away, just converts original video data for another three-dimensional array by a three-dimensional array.
3. a kind of Video Events detection method based on multi-source deep learning according to claim 1, which is characterized in that institute
Stating C3D networks is specially:
All frames of sequence of video images x are divided into the picture group of one group of 8 frame, the fc7 layer datas that every 8 frame exports a C3D are made
Extraction is characterized as a result, obtaining the feature vector of 4096 dimensions of k, finally using the fc7 layer datas in network structure as feature into
Row output, obtains 4096 dimensional feature vectors.
4. a kind of Video Events detection method based on multi-source deep learning according to claim 1, which is characterized in that institute
Stating CNN network structures is specially:The structural network that convolutional layer, full articulamentum are connected with each other with pond layer.
5. a kind of Video Events detection method based on multi-source deep learning according to claim 1, which is characterized in that institute
Beginning model training stage is stated to specifically include:
By the label of action to be identified,<BOS>Label is included in candidate word;
By maximizing class index function learning, maximum likelihood function method prediction output weight;
Softmax layers of loss is calculated and is propagated in LSTM.
6. a kind of Video Events detection method based on multi-source deep learning according to claim 1, which is characterized in that institute
The method of stating further includes:
In decoding stage, the probability distribution of a hidden state is built by following formula:
Wherein, p (yt|hn+t-1, yt-1) it is to be obtained based on whole word softmax functions, htIt is the hidden state of t steps, in t
The output of step depends not only on hidden state, additionally depends on the output of previous step t-1.
7. a kind of Video Events detection method based on multi-source deep learning according to claim 1, which is characterized in that institute
It states and unknown sample is tested specially using trained model:
With<BOS>Label starts, by all hidden state value h in obtained sequence of video images xtIt is input to second LSTM
In network system, variable below is calculated separately:
f1t=σ (W1xfzt+W1zfzt-1+b1f)
i1t=σ (W1xiht+W1zizt-1+b1i)
g1t=tanh (W1xght+W1zgzt-1+b1g)
c1t=f1t⊙c1(t-1)+i1t⊙g1t
o1t=σ (W1ofht+W1zozt-1+b1o)
zt=o1t⊙tant(c1t)
Wherein, σ is the other logic sigmoid functions of Element-Level, and tanh is hyperbolic tangent function, and ⊙ is element respective items multiplication;
W1xfIndicate the weight matrix of input and the door for forgeing door outlet chamber, W1zfIndicate the weights of output valve and the door for forgeing door outlet chamber
Matrix, W1xiIndicate the weight matrix of input gate output and the door for forgeing door outlet chamber, W1ziIndicate that output valve is exported with input gate
Between door weight matrix, W1xgIt indicates input and calculates new c1tCandidate value between door weight matrix, W1zgIndicate output
The value c new with calculating1tCandidate value between door weight matrix, W1ofIndicate out gate output and the door of forgetting door outlet chamber
Weight matrix, W1zoIndicate the weight matrix of output valve and the door of out gate outlet chamber, b1fIt indicates to forget the hiding shape that door exports
State, b1iTo indicate that input gate exports hidden state, b1gIt indicates to calculate new c1tCandidate value hidden state, b1oIndicate output
The hidden state of door output, f1tIt indicates to forget door output, i1tIndicate input gate output, o1tIndicate the output of out gate.c1tIt is thin
Cell element state value, ztIt is output valve, g1tIt is to calculate new c1tCandidate value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810290777.4A CN108764019A (en) | 2018-04-03 | 2018-04-03 | A kind of Video Events detection method based on multi-source deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810290777.4A CN108764019A (en) | 2018-04-03 | 2018-04-03 | A kind of Video Events detection method based on multi-source deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108764019A true CN108764019A (en) | 2018-11-06 |
Family
ID=63981088
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810290777.4A Pending CN108764019A (en) | 2018-04-03 | 2018-04-03 | A kind of Video Events detection method based on multi-source deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108764019A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109697852A (en) * | 2019-01-23 | 2019-04-30 | 吉林大学 | Urban road congestion degree prediction technique based on timing traffic events |
CN110443182A (en) * | 2019-07-30 | 2019-11-12 | 深圳市博铭维智能科技有限公司 | A kind of urban discharging pipeline video abnormality detection method based on more case-based learnings |
CN110738128A (en) * | 2019-09-19 | 2020-01-31 | 天津大学 | repeated video detection method based on deep learning |
CN110826702A (en) * | 2019-11-18 | 2020-02-21 | 方玉明 | Abnormal event detection method for multitask deep network |
CN111680660A (en) * | 2020-06-17 | 2020-09-18 | 郑州大学 | Human behavior detection method based on multi-source heterogeneous data stream |
CN111814644A (en) * | 2020-07-01 | 2020-10-23 | 重庆邮电大学 | Video abnormal event detection method based on disturbance visual interpretation |
CN112668366A (en) * | 2019-10-15 | 2021-04-16 | 华为技术有限公司 | Image recognition method, image recognition device, computer-readable storage medium and chip |
CN112868019A (en) * | 2018-11-14 | 2021-05-28 | 北京比特大陆科技有限公司 | Feature processing method and device, storage medium and program product |
CN116778395A (en) * | 2023-08-21 | 2023-09-19 | 成都理工大学 | Mountain torrent flood video identification monitoring method based on deep learning |
WO2023206532A1 (en) * | 2022-04-29 | 2023-11-02 | Oppo广东移动通信有限公司 | Prediction method and apparatus, electronic device and computer-readable storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107066973A (en) * | 2017-04-17 | 2017-08-18 | 杭州电子科技大学 | A kind of video content description method of utilization spatio-temporal attention model |
CN107818307A (en) * | 2017-10-31 | 2018-03-20 | 天津大学 | A kind of multi-tag Video Events detection method based on LSTM networks |
CN107832708A (en) * | 2017-11-09 | 2018-03-23 | 云丁网络技术(北京)有限公司 | A kind of human motion recognition method and device |
-
2018
- 2018-04-03 CN CN201810290777.4A patent/CN108764019A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107066973A (en) * | 2017-04-17 | 2017-08-18 | 杭州电子科技大学 | A kind of video content description method of utilization spatio-temporal attention model |
CN107818307A (en) * | 2017-10-31 | 2018-03-20 | 天津大学 | A kind of multi-tag Video Events detection method based on LSTM networks |
CN107832708A (en) * | 2017-11-09 | 2018-03-23 | 云丁网络技术(北京)有限公司 | A kind of human motion recognition method and device |
Non-Patent Citations (3)
Title |
---|
AN-AN LIU ET AL.: "LSTM-based multi-label video event detection", 《SPRINGER》 * |
DU TRAN ET AL.: "Learning Spatiotemporal Features with 3D Convolutional Networks", 《2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV)》 * |
YINGWEI PAN ET AL.: "Jointly Modeling Embedding and Translation to Bridge Video and Language", 《2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION(CVPR)》 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112868019A (en) * | 2018-11-14 | 2021-05-28 | 北京比特大陆科技有限公司 | Feature processing method and device, storage medium and program product |
CN109697852A (en) * | 2019-01-23 | 2019-04-30 | 吉林大学 | Urban road congestion degree prediction technique based on timing traffic events |
CN110443182A (en) * | 2019-07-30 | 2019-11-12 | 深圳市博铭维智能科技有限公司 | A kind of urban discharging pipeline video abnormality detection method based on more case-based learnings |
CN110738128A (en) * | 2019-09-19 | 2020-01-31 | 天津大学 | repeated video detection method based on deep learning |
CN112668366A (en) * | 2019-10-15 | 2021-04-16 | 华为技术有限公司 | Image recognition method, image recognition device, computer-readable storage medium and chip |
CN112668366B (en) * | 2019-10-15 | 2024-04-26 | 华为云计算技术有限公司 | Image recognition method, device, computer readable storage medium and chip |
CN110826702A (en) * | 2019-11-18 | 2020-02-21 | 方玉明 | Abnormal event detection method for multitask deep network |
CN111680660B (en) * | 2020-06-17 | 2023-03-24 | 郑州大学 | Human behavior detection method based on multi-source heterogeneous data stream |
CN111680660A (en) * | 2020-06-17 | 2020-09-18 | 郑州大学 | Human behavior detection method based on multi-source heterogeneous data stream |
CN111814644A (en) * | 2020-07-01 | 2020-10-23 | 重庆邮电大学 | Video abnormal event detection method based on disturbance visual interpretation |
CN111814644B (en) * | 2020-07-01 | 2022-05-03 | 重庆邮电大学 | Video abnormal event detection method based on disturbance visual interpretation |
WO2023206532A1 (en) * | 2022-04-29 | 2023-11-02 | Oppo广东移动通信有限公司 | Prediction method and apparatus, electronic device and computer-readable storage medium |
CN116778395A (en) * | 2023-08-21 | 2023-09-19 | 成都理工大学 | Mountain torrent flood video identification monitoring method based on deep learning |
CN116778395B (en) * | 2023-08-21 | 2023-10-24 | 成都理工大学 | Mountain torrent flood video identification monitoring method based on deep learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108764019A (en) | A kind of Video Events detection method based on multi-source deep learning | |
Wang et al. | Depth pooling based large-scale 3-d action recognition with convolutional neural networks | |
Mou et al. | Vehicle instance segmentation from aerial image and video using a multitask learning residual fully convolutional network | |
Wang et al. | Multiscale visual attention networks for object detection in VHR remote sensing images | |
CN111709311B (en) | Pedestrian re-identification method based on multi-scale convolution feature fusion | |
Rajan et al. | Novel deep learning model for facial expression recognition based on maximum boosted CNN and LSTM | |
Wang et al. | Large-scale continuous gesture recognition using convolutional neural networks | |
Asif et al. | A multi-modal, discriminative and spatially invariant CNN for RGB-D object labeling | |
CN107818307B (en) | Multi-label video event detection method based on LSTM network | |
Gu et al. | Multiple stream deep learning model for human action recognition | |
CN110610210B (en) | Multi-target detection method | |
Takahashi et al. | Expandable YOLO: 3D object detection from RGB-D images | |
Shi et al. | Shuffle-invariant network for action recognition in videos | |
CN107169117A (en) | A kind of manual draw human motion search method based on autocoder and DTW | |
Hussain et al. | AI-driven behavior biometrics framework for robust human activity recognition in surveillance systems | |
Özbay et al. | 3D Human Activity Classification with 3D Zernike Moment Based Convolutional, LSTM-Deep Neural Networks. | |
CN114187546B (en) | Combined action recognition method and system | |
Li et al. | Recent advances on application of deep learning for recovering object pose | |
Goswami et al. | A comprehensive review on real time object detection using deep learing model | |
Reddy P et al. | Multimodal spatiotemporal feature map for dynamic gesture recognition from real time video sequences | |
Zhang et al. | Weighted score-level feature fusion based on Dempster–Shafer evidence theory for action recognition | |
CN114140524A (en) | Closed loop detection system and method for multi-scale feature fusion | |
CN110163106A (en) | Integral type is tatooed detection and recognition methods and system | |
Caetano et al. | Magnitude-Orientation Stream network and depth information applied to activity recognition | |
Xiang | Object Detection in Finnish Movies |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20181106 |
|
WD01 | Invention patent application deemed withdrawn after publication |