CN108764019A - A kind of Video Events detection method based on multi-source deep learning - Google Patents

A kind of Video Events detection method based on multi-source deep learning Download PDF

Info

Publication number
CN108764019A
CN108764019A CN201810290777.4A CN201810290777A CN108764019A CN 108764019 A CN108764019 A CN 108764019A CN 201810290777 A CN201810290777 A CN 201810290777A CN 108764019 A CN108764019 A CN 108764019A
Authority
CN
China
Prior art keywords
video
door
output
indicate
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810290777.4A
Other languages
Chinese (zh)
Inventor
刘安安
邵壮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201810290777.4A priority Critical patent/CN108764019A/en
Publication of CN108764019A publication Critical patent/CN108764019A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a kind of Video Events detection methods based on multi-source deep learning, include the following steps:Gray processing is carried out to each frame of video and interframe makes the difference acquisition three-dimensional array, three-dimensional array is inputted in C3D networks, trained model carries out feature extraction in advance, obtains characteristics of image using C3D algorithms, obtains 4096 dimensional feature vectors;Using the RGB image of original video first frame as input, picture feature is extracted in CNN network structures, using fc7 layers of output as feature, obtains the feature vector of 2048 dimensions;Feature vector is spliced, the vector of 6124 dimensions is obtained, after processing to all training videos and be converted into vector, carries out dimensionality reduction, converts the data of 6124 dimensions to the data of 256 dimensions;Start model training stage after each pending video is mapped as the data of one 256 dimension;Unknown sample is tested using trained model.The present invention greatly improves the efficiency and accuracy rate of video monitoring information processing.

Description

A kind of Video Events detection method based on multi-source deep learning
Technical field
The present invention relates to Video Events detection field more particularly to a kind of Video Events detections based on multi-source deep learning Method.
Background technology
The processing of video monitoring is in an increasingly wide range of applications in fields such as public safety, business.Video monitoring is with it Intuitively, accurately, it is abundant with the information content in time and be widely used in many occasions.In recent years, with computer, network and figure As processing, the rapid development of transmission technology, there has also been significant progresses for Video Supervision Technique.However, the place of current video monitoring Reason depends on artificial treatment, and costly, processing speed and efficiency are also relatively slow for cost.
A variety of methods have been proposed at present to detect Video Events.First, the human motion of unmarked view-based access control model Analysis is likely to provide a cheap, unobtrusive method to estimate the posture of human body.Therefore, it is widely used in transporting Dynamic analysis.Fujiyoshi etc. (such as bibliography [1]) proposes " fixed star " skeletonizing process to analyze target movement.Secondly, row Dynamic or collective activity identification can teach that the existence of group event in video.
A kind of video actions identification of the shallow higher-dimension coding based on early stage local space time's feature is had also been proposed in the prior art Method.Sparse space-time point of interest can with local space time come Expressive Features, including:The histogram (HOG) of gradient vector and Light stream histogram (HOF) (such as bibliography [2]).These characteristics are then encoded into feature packet (BoF) description (as with reference to text Offer [3]), then support vector machines is used to carry out classification task.In addition, being dedicated in video in a large amount of related work of last decade Group work research.Work before the overwhelming majority be all using the feature of engineer come individual aerial when stating (such as Bibliography [4]).Lan etc. (such as bibliography [5]), which is proposed, can represent the layer from lower people and information to higher group It is secondary with interactive level relationship, and being capable of potential adaptive Structure learning.
Recently, multi-task learning method has been applied to mode of human's social activity identification.Wherein, Liu et al. (such as bibliography [6]) It proposes a kind of hierarchical clustering multi-task learning human behavior is grouped and is identified.Again, video frequency abstract is to be used for vision Another method for understanding and showing.There are several methods that can generate video frequency abstract from a long video.A kind of representativeness side Method is to appear in video in different time periods to an object and activity to generate summary.Pritch etc. (such as bibliography [7]) is also A kind of new method is proposed, it can be according to similar events or activities fasciation at short and coherent video outline.Another method Generate text based abstract.Chu et al. (such as bibliography [8]) proposes a multimedia analysis frame while handling video And text, the relationship between entity is built jointly by scene graph to understand event (such as bibliography [9]).
Current most methods are required for handling multinomial challenging visual analysis task.Lee (such as bibliography [10]) an effective Gaussian Mixture learning method is proposed for video background removal.Dai etc. (such as bibliography [11]) is carried A kind of R-FCN object detection networks of robust are gone out.Although existing method shows on the problem of handling some aspects Validity is gone out, the processing for being directed to automatic understanding monitor video still has many challenges and limitation.Main challenge comes From following two aspects:The problem of complexity and processing method of data.It is directed to data itself, main challenge is Resolution ratio is low, data volume is big, event set and scene are complicated, data source occlusion.
For method, mainly there is limitation below:
1) many methods depend on prospect background cutting techniques, however this technology can cause mistake cumulative.
2) many methods depend on detect and track, however for different videos and mobile object, detect and track Robustness it is relatively low, these disadvantages reduce the efficiency of time analysis.
3) when data volume increases, calculation amount can be substantially improved.
4) the problem of most of the event detection in real-life is multi-tag, is especially among monitor video, more A event can occur simultaneously.However, action recognition and group activity identification are all based on the event detecting method of single label.Cause This both recognition methods can lose the simultaneous time.
5) most methods are only with a kind of feature extracting method, and feature is more single, the tables of data being susceptible in root Sign is inaccurate.
Invention content
The present invention provides a kind of Video Events detection methods based on multi-source deep learning, and the present invention is based on multi-source informations Video information is handled, original model can be optimized based on new data, due to using C3D features and tradition CNN features It is merged, C3D features can be added in the case where keeping original characteristic maturation to be easy to the advantage of extraction, it is more accurate to be carried out to video Characterization, can greatly improve video monitoring information processing efficiency and accuracy rate, it is described below:
A kind of Video Events detection method based on multi-source deep learning, the described method comprises the following steps:
Gray processing is carried out to each frame of video and adjacent video frames are carried out to subtract each other processing acquisition three-dimensional array, it will Three-dimensional array inputs in C3D networks, and trained model carries out feature extraction in advance, obtains characteristics of image using C3D algorithms, obtains To 4096 dimensional feature vectors;
Using the RGB image of original video first frame as input, picture feature is extracted in CNN network structures, by fc7 layers Output as feature, obtain the feature vector of 2048 dimensions;
Feature vector is spliced, the vector of 6124 dimensions is obtained, all training videos are processed and be converted into After amount, dimensionality reduction is carried out, converts the data of 6124 dimensions to the data of 256 dimensions;
Start model training stage after each pending video is mapped as the data of one 256 dimension;Using trained Model tests unknown sample.
Each frame progress gray processing and interframe to video makes the difference acquisition three-dimensional array:
For a given image sequence x={ x1, x2..., xn, corresponding tally set is y={ y1, y2..., ym, each frame of video is subjected to gray processing first, then the second frame makes the difference with first frame, and third frame makes the difference with the second frame, with This analogizes, and just converts original video data for another three-dimensional array by a three-dimensional array.
The C3D networks are specially:
All frames of sequence of video images x are divided into the picture group of one group of 8 frame, every 8 frame exports the fc7 numbers of plies of a C3D According to as feature extraction as a result, the feature vector of 4096 dimensions of k is obtained, finally using the fc7 layer datas in network structure as special Sign is exported, and 4096 dimensional feature vectors are obtained.
The CNN network structures are specially:The structural network that convolutional layer, full articulamentum are connected with each other with pond layer, such as Fig. 5 It is shown.
The beginning model training stage specifically includes:
By the label of action to be identified,<BOS>Label is included in candidate word;
By maximizing class index function learning, maximum likelihood function method prediction output weight;
Softmax layers of loss is calculated and is propagated in LSTM.
The method further includes:
In decoding stage, the probability distribution of a hidden state is built by following formula:
Wherein, p (yt|hn+t-1,yt-1) it is to be obtained based on whole word softmax functions, htIt is the hidden state of t steps, Hidden state is depended not only in the output of t steps, additionally depends on the output of previous step t-1.
It is described that unknown sample is tested specially using trained model:
With<BOS>Label starts, by all hidden state value h in obtained sequence of video images xtIt is input to second In LSTM network systems, variable below is calculated separately:
f1t=σ (W1xfzt+W1zfzt-1+b1f)
i1t=σ (W1xiht+W1zizt-1+b1i)
g1t=tanh (W1xght+W1zgzt-1+b1g)
c1t=f1t⊙c1(t-1)+i1t⊙g1t
o1t=σ (W1ofht+W1zozt-1+b1o)
zt=o1t⊙tanh(c1t)
Wherein, σ is the other logic sigmoid functions of Element-Level, and tanh is hyperbolic tangent function, and ⊙ is that element respective items multiply Method;W1xfIndicate the weight matrix of input and the door for forgeing door outlet chamber, W1zfIndicate output valve and the door of forgetting door outlet chamber Weight matrix, W1xiIndicate the weight matrix of input gate output and the door for forgeing door outlet chamber, W1ziIndicate output valve and input gate The weight matrix of the door of outlet chamber, W1xgIt indicates input and calculates new c1tCandidate value between door weight matrix, W1zgIt indicates The output valve c new with calculating1tCandidate value between door weight matrix, W1ofIt indicates out gate output and forgets door outlet chamber The weight matrix of door, W1zoIndicate the weight matrix of output valve and the door of out gate outlet chamber, b1fIt indicates to forget hiding for door output State, b1iTo indicate that input gate exports hidden state, b1gIt indicates to calculate new c1tCandidate value hidden state, b1oIndicate defeated It gos out the hidden state of output, f1tIt indicates to forget door output, i1tIndicate input gate output, o1tIndicate the output of out gate.c1tIt is Cell member state value, ztIt is output valve, g1tIt is to calculate new c1tCandidate value.
The advantageous effect of technical solution provided by the invention is:
1, multiple events that video monitoring is generated by this method are reported, have avoided object monitoring and tracking process;
2, video is included by this method image information and action message, spatial information and temporal information are in different depth It is separately handled in network, has accomplished multisource data fusion;
3, this method devises completely new binary-flow network framework for video monitoring processing, and improves treatment effeciency and place Rationality energy.
Description of the drawings
Fig. 1 is a kind of flow chart of the Video Events detection method based on multi-source deep learning;
Fig. 2 is bilateral LSTM network structures;
Fig. 3 is the flow diagram of C3D convolution;
Fig. 4 is the schematic network structure of C3D;
Fig. 5 is the schematic network structure that CNN networks extract picture feature.
Specific implementation mode
To make the object, technical solutions and advantages of the present invention clearer, embodiment of the present invention is made below further It is described in detail on ground.
Embodiment 1
An embodiment of the present invention provides a kind of Video Events detection methods based on multi-source deep learning, referring to Fig. 1, the inspection Survey method includes the following steps:
101:Gray processing is carried out to each frame of video and interframe makes the difference acquisition three-dimensional array, by three-dimensional array It inputs in C3D networks, trained model carries out feature extraction in advance, obtains characteristics of image using C3D algorithms, obtains 4096 dimensions Feature vector;
102:Using the RGB image of original video first frame as input, picture feature is extracted in CNN network structures, it will Fc7 layers of output obtains the feature vector of 2048 dimensions as feature;
103:Feature vector is spliced, the vector of 6124 dimensions is obtained, all training videos are processed and converted After vector, dimensionality reduction is carried out, converts the data of 6124 dimensions to the data of 256 dimensions;
104:Start model training stage after each pending video is mapped as the data of one 256 dimension;Utilize training Good model tests unknown sample.
In conclusion the embodiment of the present invention, which is based on multi-source information, handles video information, it can be based on new data to original Model optimizes, and greatly improves the efficiency and accuracy rate of video monitoring information processing.
Embodiment 2
The scheme in embodiment 1 is further introduced with reference to specific calculation formula, example, it is as detailed below Description:
C3D is a job of facebook, and network is constructed using 3D convolution sum 3D Pooling.By 3D convolution, C3D can directly handle video (or perhaps volume of video frame).The sharpest edges of C3D are that its speed, speed are 314fps.And actually this is based on the video card before 2 years.It can reach 600fps or more with 1080 video cards of Nvidia.Institute Efficiency with C3D is other methods to be significantly larger than.
201:C3D algorithms are used to extract video features first;
Convolutional neural networks (CNN) are widely used in computer vision in recent years, including:Classification, detection, segmentation etc. are appointed Business.These tasks are typically all to be carried out for image, use two-dimensional convolution (i.e. the dimension of convolution kernel is two dimension).And it is right In based on video analysis the problem of, 2D convolution cannot the fine information that must be captured in sequential.Therefore 3D convolution just has also been proposed.
C3D is the improvement based on original convolutional neural networks algorithm, and original each convolutional layer of convolutional neural networks makes Convolution is carried out with two-dimensional convolution kernel, has changed three dimensional convolution kernel among C3D algorithms, can thus handle multiple image number According to.Used network structure is as shown in Figure 3.The classification that one group of every 8 frame is carried out for the video of input, then inputs it C3D algorithms obtain multigroup 4096 dimensional vector.
202:For a given image sequence x={ x1, x2..., xn, corresponding tally set is y={ y1, y2..., ym, each frame of video is subjected to gray processing first, then the second frame makes the difference with first frame, and third frame is done with the second frame Difference, and so on, in this way, just original video data is converted by a three-dimensional array for another three-dimensional array;
203:Using the obtained transformed three-dimensional array of above-mentioned steps as input, C3D networks as shown in Figure 3 are inputted In, feature extraction is carried out with the Sport1M models of advance trained 1,900,000 iteration on other data sets, is calculated using C3D Method obtains characteristics of image;
Referring to Fig. 3, which is specially:For the sequence of video images x={ x of each input1, x2..., xt..., xn, wherein x1, x2..., xt..., xnCorrespond to respectively the 1st frame in sequence of video images x, the 2nd frame ..., t frames ..., All frames of sequence of video images x, are divided into the picture group of one group of 8 frame by n-th frame image, and every 8 frame exports the fc7 layers of a C3D For data as feature extraction as a result, obtaining the feature vector of k 4096 dimension, wherein k is 8 downward roundings of n ÷.Finally by network knot Fc7 layer datas in structure are exported as feature, obtain 4096 dimensional feature vectors.
204:Using the RGB image of original video first frame as inputting, figure is extracted in CNN network structures shown in Fig. 4 Piece feature obtains the feature vector of one 2048 dimension using fc7 layers of output as feature;
Referring to Fig. 4, which is specially:The Structure Network that convolutional layer, full articulamentum are connected with each other with pond layer Network.
205:Then the feature vector that upper two steps are extracted is spliced, obtains the vector of one 6124 dimension, for All training videos do treatment thereto, are completely converted into after vector, carry out dimensionality reduction, convert the data of 6124 dimensions to The data of 256 dimensions;
206:Start model training rank after each pending video is mapped as the data of one 256 dimension by 205 steps Section;
When decoding the training stage, such as regulation has n actions to be identified, and the form of label is that label 1 arrives n, this n is a Label is also known as candidate word.In data prediction by the action that the video occurs connect as sentence (such as:1,3,6 It represents and 1,3,6 these three actions successively has occurred in video.), in addition, in general, starting to need a mark due to candidate word Will, that is, so-called<BOS>Label, therefore will<BOS>Label is also included in candidate word<BOS>,1,3,6.).
Then, this frame is attempted (to ensure (5) by maximum likelihood function method by maximizing class index function learning Maximum value θ is got in formulamaxProbability distribution, then the hidden state that is obtained by probability distribution in (2) formula and defeated in last moment Go out (prediction output can be arbitrary initial value as t=0) prediction output weight.
Finally, softmax layers of loss is calculated and is propagated based on (4) formula in LSTM, and LSTM is for RNN (cycle god Through network) upgrading variant, this method enables previous information to be contacted with current task, and computer is allow to handle It is this kind of problem of different length sequence to output and input.For the input x of t momentt, can be based on by formula below ht-1And xtContinuously calculate variable below:(1)
ft=σ (Wxfxt+Whfht-1+bf)
it=σ (Wxixt+Whiht-1+bi)
gt=tanh (Wxgxt+Whght-1+bg)
ct=ft⊙ct-1+it⊙gt
ot=σ (Wofxt+Whoht-1+bo)
ht=ot⊙tanh(ct)
Wherein, σ is the other logic sigmoid functions of Element-Level, and tanh is hyperbolic tangent function, and ⊙ is that element respective items multiply Method;WxfIndicate the weight matrix of input and the door for forgeing door outlet chamber, WhfIt indicates hidden state value and forgets the door of door outlet chamber Weight matrix, WxiIndicate the weight matrix of input gate output and the door for forgeing door outlet chamber, WhiIndicate hidden state value with it is defeated The weight matrix of the door for outlet chamber of geting started, WxgIt indicates input and calculates new ctCandidate value between door weight matrix, WhgTable Show hidden state value and calculates new ctCandidate value between door weight matrix, WofIt indicates out gate output and forgets door output Between door weight matrix, WhoIndicate the weight matrix of hidden state value and the door of out gate outlet chamber, bfIndicate that forgetting door is defeated The hidden state gone out, biIndicate that input gate exports hidden state, bgIt indicates to calculate new ctCandidate value hidden state, boTable Show the hidden state of out gate output, ftIt indicates to forget door output, itIndicate input gate output, otIndicate the output of out gate.ct It is cell member state value, htIt is hidden state value, gtIt is to calculate new ctCandidate value.
LSTM is also a kind of tool of coder-decoder framework.Assuming that input is x={ x1, x2..., xn, and it is defeated It is y={ y to go out1, y2..., ym}.In coding stage, x is encoded, and calculates above-mentioned variable by LSTM.In decoding rank Section, the probability distribution of a hidden state is built by the algorithmic formula of designed, designed:
Wherein, p (yt|hn+t-1,yt-1) it is to be obtained based on whole word (label) softmax functions, htIt is the hidden of t steps Tibetan state is calculated by (1) formula.Hidden state is depended not only in the output of t steps, additionally depends on the defeated of previous step t-1 Go out.
207:Unknown sample is tested using step 206 trained model.
In test phase, the video inputted as the processing that hereinbefore training step is done, obtain one 256 dimension Feature, decoding start.With<BOS>Label starts, by all hidden states in the sequence of video images x obtained in step 206 Value ht, t=1,2 ..., n are input in second exclusive LSTM network system of this method, calculate separately variable below:
Wherein, σ is the other logic sigmoid functions of Element-Level, and tanh is hyperbolic tangent function, and ⊙ is that element respective items multiply Method;W1xfIndicate the weight matrix of input and the door for forgeing door outlet chamber, W1zfIndicate output valve and the door of forgetting door outlet chamber Weight matrix, W1xiIndicate the weight matrix of input gate output and the door for forgeing door outlet chamber, W1ziIndicate output valve and input gate The weight matrix of the door of outlet chamber, W1xgIt indicates input and calculates new c1tCandidate value between door weight matrix, W1zgIt indicates The output valve c new with calculating1tCandidate value between door weight matrix, W1ofIt indicates out gate output and forgets door outlet chamber The weight matrix of door, W1zoIndicate the weight matrix of output valve and the door of out gate outlet chamber, b1fIt indicates to forget hiding for door output State, b1iTo indicate that input gate exports hidden state, b1gIt indicates to calculate new c1tCandidate value hidden state, b1oIndicate defeated It gos out the hidden state of output, f1tIt indicates to forget door output, i1tIndicate input gate output, o1tIndicate the output of out gate.c1tIt is Cell member state value, ztIt is output valve, g1tIt is to calculate new c1tCandidate value.
Export ztIn obtain be each word the score for each label in vocabulary.Then for each label Carry out mean value pond.This value can be used for indicating the possibility that each time occurs in the segment.By in (6) formula Sigmoid functions generate a probability distribution, obtain the probability that each action occurs in corresponding video, then by giving threshold Value (is generally set to 0.5 and is compared the label 0 or 1 for judging whether the action occurs in video), finally will be each potential dynamic The label of work is integrated, and the prediction result for whole section of video is obtained.
In conclusion the embodiment of the present invention, which is based on multi-source information, handles video information, it can be based on new data to original Model optimizes, and greatly improves the efficiency and accuracy rate of video monitoring information processing.
Bibliography
[1]Gutchess D,Trajkovics M,Cohen-Solal E,et al.A background model initialization algorithm for video surveillance[C]//Computer Vision,2001.ICCV 2001.Proceedings.Eighth IEEE International Conference on.IEEE,2001,1:733-740.
[2]Fan C,Crandall D J.Deepdiary:Automatically captioning lifelogging image streams[C]//European Conference on Computer Vision.Springer International Publishing,2016:459-473.
[3]Lazebnik S,Schmid C,Ponce J.Beyond bags of features:Spatial pyramid matching for recognizing natural scene categories[C]//Computer vision and pattern recognition,2006IEEE computer society conference on.IEEE,2006,2: 2169-2178.
[4]Ibrahim M S,Muralidharan S,Deng Z,et al.A hierarchical deep temporal model for group activity recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:1971-1980.
[5]Lan T,Wang Y,Yang W,et al.Discriminative latent models for recognizing contextual group activities[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2012,34(8):1549-1562.
[6]Liu A A,Su Y T,Nie W Z,et al.Hierarchical clustering multi-task learning for joint human action grouping and recognition[J].IEEE transactions on pattern analysis and machine intelligence,2017,39(1):102-114.
[7]Pritch Y,Ratovitch S,Hendel A,et al.Clustered synopsis of surveillance video[C]//Advanced Video and Signal Based Surveillance, 2009.AVSS'09.Sixth IEEE International Conference on.IEEE,2009:195-200.
[8]Tu K,Meng M,Lee M W,et al.Joint video and text parsing for understanding events and answering queries[J].IEEE MultiMedia,2014,21(2):42- 70.
[9]He X,Gao M,Kan M Y,et al.Birank:Towards ranking on bipartite graphs[J].IEEE Transactions on Knowledge and Data Engineering,2017,29(1):57- 71.
[10]Hochreiter S,Schmidhuber J.LSTM can solve hard long time lag problems[C]//Advances in neural information processing systems.1997:473-479.
[11]Venugopalan S,Rohrbach M,Donahue J,et al.Sequence to Sequence-- Video to Text[C]//IEEE International Conference on Computer Vision.IEEE,2015: 4534-4542.
To the model of each device in addition to doing specified otherwise, the model of other devices is not limited the embodiment of the present invention, As long as the device of above-mentioned function can be completed.
It will be appreciated by those skilled in the art that attached drawing is the schematic diagram of a preferred embodiment, the embodiments of the present invention Serial number is for illustration only, can not represent the quality of embodiment.
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent replacement, improvement and so on should all be included in the protection scope of the present invention.

Claims (7)

1. a kind of Video Events detection method based on multi-source deep learning, which is characterized in that the described method comprises the following steps:
Gray processing is carried out to each frame of video and interframe makes the difference acquisition three-dimensional array, three-dimensional array is inputted into C3D nets In network, trained model carries out feature extraction in advance, obtains characteristics of image using C3D algorithms, obtains 4096 dimensional feature vectors;
Using the RGB image of original video first frame as input, picture feature is extracted in CNN network structures, it is defeated by fc7 layers Go out the feature vector that 2048 dimensions are obtained as feature;
Feature vector is spliced, the vector of 6124 dimensions is obtained, all training videos are processed and is converted into vector Afterwards, dimensionality reduction is carried out, converts the data of 6124 dimensions to the data of 256 dimensions;
Start model training stage after each pending video is mapped as the data of one 256 dimension;Utilize trained model Unknown sample is tested.
2. a kind of Video Events detection method based on multi-source deep learning according to claim 1, which is characterized in that institute It states and gray processing is carried out to each frame of video and interframe makes the difference and obtains three-dimensional array and be specially:
For a given image sequence x={ x1, x2..., xn, corresponding tally set is y={ y1, y2..., ym, it is first Each frame of video is first subjected to gray processing, then the second frame makes the difference with first frame, and third frame makes the difference with the second frame, with such It pushes away, just converts original video data for another three-dimensional array by a three-dimensional array.
3. a kind of Video Events detection method based on multi-source deep learning according to claim 1, which is characterized in that institute Stating C3D networks is specially:
All frames of sequence of video images x are divided into the picture group of one group of 8 frame, the fc7 layer datas that every 8 frame exports a C3D are made Extraction is characterized as a result, obtaining the feature vector of 4096 dimensions of k, finally using the fc7 layer datas in network structure as feature into Row output, obtains 4096 dimensional feature vectors.
4. a kind of Video Events detection method based on multi-source deep learning according to claim 1, which is characterized in that institute Stating CNN network structures is specially:The structural network that convolutional layer, full articulamentum are connected with each other with pond layer.
5. a kind of Video Events detection method based on multi-source deep learning according to claim 1, which is characterized in that institute Beginning model training stage is stated to specifically include:
By the label of action to be identified,<BOS>Label is included in candidate word;
By maximizing class index function learning, maximum likelihood function method prediction output weight;
Softmax layers of loss is calculated and is propagated in LSTM.
6. a kind of Video Events detection method based on multi-source deep learning according to claim 1, which is characterized in that institute The method of stating further includes:
In decoding stage, the probability distribution of a hidden state is built by following formula:
Wherein, p (yt|hn+t-1, yt-1) it is to be obtained based on whole word softmax functions, htIt is the hidden state of t steps, in t The output of step depends not only on hidden state, additionally depends on the output of previous step t-1.
7. a kind of Video Events detection method based on multi-source deep learning according to claim 1, which is characterized in that institute It states and unknown sample is tested specially using trained model:
With<BOS>Label starts, by all hidden state value h in obtained sequence of video images xtIt is input to second LSTM In network system, variable below is calculated separately:
f1t=σ (W1xfzt+W1zfzt-1+b1f)
i1t=σ (W1xiht+W1zizt-1+b1i)
g1t=tanh (W1xght+W1zgzt-1+b1g)
c1t=f1t⊙c1(t-1)+i1t⊙g1t
o1t=σ (W1ofht+W1zozt-1+b1o)
zt=o1t⊙tant(c1t)
Wherein, σ is the other logic sigmoid functions of Element-Level, and tanh is hyperbolic tangent function, and ⊙ is element respective items multiplication; W1xfIndicate the weight matrix of input and the door for forgeing door outlet chamber, W1zfIndicate the weights of output valve and the door for forgeing door outlet chamber Matrix, W1xiIndicate the weight matrix of input gate output and the door for forgeing door outlet chamber, W1ziIndicate that output valve is exported with input gate Between door weight matrix, W1xgIt indicates input and calculates new c1tCandidate value between door weight matrix, W1zgIndicate output The value c new with calculating1tCandidate value between door weight matrix, W1ofIndicate out gate output and the door of forgetting door outlet chamber Weight matrix, W1zoIndicate the weight matrix of output valve and the door of out gate outlet chamber, b1fIt indicates to forget the hiding shape that door exports State, b1iTo indicate that input gate exports hidden state, b1gIt indicates to calculate new c1tCandidate value hidden state, b1oIndicate output The hidden state of door output, f1tIt indicates to forget door output, i1tIndicate input gate output, o1tIndicate the output of out gate.c1tIt is thin Cell element state value, ztIt is output valve, g1tIt is to calculate new c1tCandidate value.
CN201810290777.4A 2018-04-03 2018-04-03 A kind of Video Events detection method based on multi-source deep learning Pending CN108764019A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810290777.4A CN108764019A (en) 2018-04-03 2018-04-03 A kind of Video Events detection method based on multi-source deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810290777.4A CN108764019A (en) 2018-04-03 2018-04-03 A kind of Video Events detection method based on multi-source deep learning

Publications (1)

Publication Number Publication Date
CN108764019A true CN108764019A (en) 2018-11-06

Family

ID=63981088

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810290777.4A Pending CN108764019A (en) 2018-04-03 2018-04-03 A kind of Video Events detection method based on multi-source deep learning

Country Status (1)

Country Link
CN (1) CN108764019A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109697852A (en) * 2019-01-23 2019-04-30 吉林大学 Urban road congestion degree prediction technique based on timing traffic events
CN110443182A (en) * 2019-07-30 2019-11-12 深圳市博铭维智能科技有限公司 A kind of urban discharging pipeline video abnormality detection method based on more case-based learnings
CN110738128A (en) * 2019-09-19 2020-01-31 天津大学 repeated video detection method based on deep learning
CN110826702A (en) * 2019-11-18 2020-02-21 方玉明 Abnormal event detection method for multitask deep network
CN111680660A (en) * 2020-06-17 2020-09-18 郑州大学 Human behavior detection method based on multi-source heterogeneous data stream
CN111814644A (en) * 2020-07-01 2020-10-23 重庆邮电大学 Video abnormal event detection method based on disturbance visual interpretation
CN112668366A (en) * 2019-10-15 2021-04-16 华为技术有限公司 Image recognition method, image recognition device, computer-readable storage medium and chip
CN112868019A (en) * 2018-11-14 2021-05-28 北京比特大陆科技有限公司 Feature processing method and device, storage medium and program product
CN116778395A (en) * 2023-08-21 2023-09-19 成都理工大学 Mountain torrent flood video identification monitoring method based on deep learning
WO2023206532A1 (en) * 2022-04-29 2023-11-02 Oppo广东移动通信有限公司 Prediction method and apparatus, electronic device and computer-readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107066973A (en) * 2017-04-17 2017-08-18 杭州电子科技大学 A kind of video content description method of utilization spatio-temporal attention model
CN107818307A (en) * 2017-10-31 2018-03-20 天津大学 A kind of multi-tag Video Events detection method based on LSTM networks
CN107832708A (en) * 2017-11-09 2018-03-23 云丁网络技术(北京)有限公司 A kind of human motion recognition method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107066973A (en) * 2017-04-17 2017-08-18 杭州电子科技大学 A kind of video content description method of utilization spatio-temporal attention model
CN107818307A (en) * 2017-10-31 2018-03-20 天津大学 A kind of multi-tag Video Events detection method based on LSTM networks
CN107832708A (en) * 2017-11-09 2018-03-23 云丁网络技术(北京)有限公司 A kind of human motion recognition method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
AN-AN LIU ET AL.: "LSTM-based multi-label video event detection", 《SPRINGER》 *
DU TRAN ET AL.: "Learning Spatiotemporal Features with 3D Convolutional Networks", 《2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV)》 *
YINGWEI PAN ET AL.: "Jointly Modeling Embedding and Translation to Bridge Video and Language", 《2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION(CVPR)》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112868019A (en) * 2018-11-14 2021-05-28 北京比特大陆科技有限公司 Feature processing method and device, storage medium and program product
CN109697852A (en) * 2019-01-23 2019-04-30 吉林大学 Urban road congestion degree prediction technique based on timing traffic events
CN110443182A (en) * 2019-07-30 2019-11-12 深圳市博铭维智能科技有限公司 A kind of urban discharging pipeline video abnormality detection method based on more case-based learnings
CN110738128A (en) * 2019-09-19 2020-01-31 天津大学 repeated video detection method based on deep learning
CN112668366A (en) * 2019-10-15 2021-04-16 华为技术有限公司 Image recognition method, image recognition device, computer-readable storage medium and chip
CN112668366B (en) * 2019-10-15 2024-04-26 华为云计算技术有限公司 Image recognition method, device, computer readable storage medium and chip
CN110826702A (en) * 2019-11-18 2020-02-21 方玉明 Abnormal event detection method for multitask deep network
CN111680660B (en) * 2020-06-17 2023-03-24 郑州大学 Human behavior detection method based on multi-source heterogeneous data stream
CN111680660A (en) * 2020-06-17 2020-09-18 郑州大学 Human behavior detection method based on multi-source heterogeneous data stream
CN111814644A (en) * 2020-07-01 2020-10-23 重庆邮电大学 Video abnormal event detection method based on disturbance visual interpretation
CN111814644B (en) * 2020-07-01 2022-05-03 重庆邮电大学 Video abnormal event detection method based on disturbance visual interpretation
WO2023206532A1 (en) * 2022-04-29 2023-11-02 Oppo广东移动通信有限公司 Prediction method and apparatus, electronic device and computer-readable storage medium
CN116778395A (en) * 2023-08-21 2023-09-19 成都理工大学 Mountain torrent flood video identification monitoring method based on deep learning
CN116778395B (en) * 2023-08-21 2023-10-24 成都理工大学 Mountain torrent flood video identification monitoring method based on deep learning

Similar Documents

Publication Publication Date Title
CN108764019A (en) A kind of Video Events detection method based on multi-source deep learning
Wang et al. Depth pooling based large-scale 3-d action recognition with convolutional neural networks
Mou et al. Vehicle instance segmentation from aerial image and video using a multitask learning residual fully convolutional network
Wang et al. Multiscale visual attention networks for object detection in VHR remote sensing images
CN111709311B (en) Pedestrian re-identification method based on multi-scale convolution feature fusion
Rajan et al. Novel deep learning model for facial expression recognition based on maximum boosted CNN and LSTM
Wang et al. Large-scale continuous gesture recognition using convolutional neural networks
Asif et al. A multi-modal, discriminative and spatially invariant CNN for RGB-D object labeling
CN107818307B (en) Multi-label video event detection method based on LSTM network
Gu et al. Multiple stream deep learning model for human action recognition
CN110610210B (en) Multi-target detection method
Takahashi et al. Expandable YOLO: 3D object detection from RGB-D images
Shi et al. Shuffle-invariant network for action recognition in videos
CN107169117A (en) A kind of manual draw human motion search method based on autocoder and DTW
Hussain et al. AI-driven behavior biometrics framework for robust human activity recognition in surveillance systems
Özbay et al. 3D Human Activity Classification with 3D Zernike Moment Based Convolutional, LSTM-Deep Neural Networks.
CN114187546B (en) Combined action recognition method and system
Li et al. Recent advances on application of deep learning for recovering object pose
Goswami et al. A comprehensive review on real time object detection using deep learing model
Reddy P et al. Multimodal spatiotemporal feature map for dynamic gesture recognition from real time video sequences
Zhang et al. Weighted score-level feature fusion based on Dempster–Shafer evidence theory for action recognition
CN114140524A (en) Closed loop detection system and method for multi-scale feature fusion
CN110163106A (en) Integral type is tatooed detection and recognition methods and system
Caetano et al. Magnitude-Orientation Stream network and depth information applied to activity recognition
Xiang Object Detection in Finnish Movies

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20181106

WD01 Invention patent application deemed withdrawn after publication