CN105844239B - It is a kind of that video detecting method is feared based on CNN and LSTM cruelly - Google Patents

It is a kind of that video detecting method is feared based on CNN and LSTM cruelly Download PDF

Info

Publication number
CN105844239B
CN105844239B CN201610168334.9A CN201610168334A CN105844239B CN 105844239 B CN105844239 B CN 105844239B CN 201610168334 A CN201610168334 A CN 201610168334A CN 105844239 B CN105844239 B CN 105844239B
Authority
CN
China
Prior art keywords
cnn
feature
lstm
video
spp
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610168334.9A
Other languages
Chinese (zh)
Other versions
CN105844239A (en
Inventor
苏菲
宋凡
宋一凡
赵志诚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201610168334.9A priority Critical patent/CN105844239B/en
Publication of CN105844239A publication Critical patent/CN105844239A/en
Application granted granted Critical
Publication of CN105844239B publication Critical patent/CN105844239B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/44Event detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a kind of to fear video detecting method based on CNN and LSTM cruelly, belongs to pattern-recognition, video detection, depth learning technology field.The detection method carries out crucial frame sampling to video to be detected first, and extracts key frame feature;Then the expression and differentiation for carrying out video level, VLAD feature representation and SVM differentiation, the scene VLAD feature representation of CNN scene module and SVM including CNN semantic modules differentiate and the LSTM of LSTM tfi module differentiates;Finally carry out result fusion.Advantage present invention utilizes CNN in image characteristics extraction and LSTM in terms of time series expression, and fully consider and fear characteristic of the video in terms of scene cruelly, Testing index mAP value reaches 98.0% in actual test, close to manual work level.In terms of the speed of service, only with the mode that single machine GPU accelerates, 76.4 seconds network videos can be handled each second, suitable for blocking probably propagation of the video on Large video website cruelly, be conducive to maintain social stability and national long-term stability.

Description

It is a kind of that video detecting method is feared based on CNN and LSTM cruelly
Technical field
The invention belongs to pattern-recognition, video detection, depth learning technology fields, and in particular to one kind based on CNN and LSTM's fears video detecting method cruelly.
Background technique
In recent years, a large amount of local and overseas violence terror videos are illegally propagated on the internet, are had become and are endangered social stability Great malignant tumor.But probably video detection technology is still in development phase cruelly for relevant automation, and most of is using existing thing Part video detecting method, these methods can be divided into three classes substantially: video detecting method based on image local feature is based on language The video detecting method of adopted concept and be based on convolutional neural networks (Convolutional Neural Network, abbreviation CNN) Video detecting method.
Bibliography [1] (Sun, Chen, and Ram Nevatia. " Large-scale web video event classification by use of fisher vectors."In Applications of Computer Vision (WACV), 2013IEEE Workshop on, pp.15-22.IEEE, 2013.) disclose it is a kind of based on image local feature Video detecting method extracts the local feature of image, such as Scale invariant features transform (Scale- first in key frame level Invariant Feature Transform, abbreviation SIFT) feature;Then in video level, using Fisher core vector The mode of (Fisher Vector) expression obtains the global expression of video;Finally recycle support vector machines (Support Vector Machine, abbreviation SVM) classifier, differentiate the classification of video, e.g. fears video or non-sudden and violent probably video cruelly.It should Method does not need excessive artificial mark in the training process, simple and easy, but existing following insufficient: (1) Detection accuracy by It is limited to used local feature.(2) detection speed is slower.The computing cost of the local features such as SIFT is larger, leads to this method It should not be applied to extensive video detection task, practicability is not high.
Bibliography [2] (Liu, J.;Yu,Qian;Javed,O.;Ali,S.;Tamrakar,A.;Divakaran,A.; Hui Cheng;&Sawhney,H.,Video event recognition using concept attributes,WACV, 2013.) a kind of video detecting method based on semantic concept is disclosed, it is necessary first in key frame level, using local feature Extract the mode that combines with SVM classifier, differentiate various default semantic concepts in picture (for cruelly probably for video, these Semantic concept includes but is not limited to gun, explosion, masked man, sudden and violent probably tissue marker etc.) confidence level;Then in video level, By the way of Fisher Vector expression, the global characteristics of video are generated;SVM classifier is finally used again, differentiates video Type.Since default semantic concept has guiding performance, based on the video detecting method of semantic concept for fearing video identification cruelly Precision is higher, but has the following deficiencies: (1) image pattern for needing largely to have mark in training process, and artificial expense is larger. (2) when it is to be detected cruelly probably video just do not include any default concept when, detection accuracy does not ensure.(3) detection speed compared with Slowly.
Bibliography [3] (Xu, Zhongwen, Yi Yang, and Alexander G.Hauptmann. " Adiscriminative CNN video representation for event detection."arXiv preprint ArXiv:1411.4006 (2014)) a kind of video detecting method based on CNN semantic feature is disclosed, in the training stage, use Largely there is mark image training CNN semantic model.And in test phase, the CNN language for the model extraction key frame that utilization has been trained Adopted feature (features such as such as FC6, FC7, SPP), then in video level using local feature polymerization description son (Vector of Locally Aggregated Descriptors, VLAD) method, it carries out the expression of feature and obtains the high dimensional feature of video, This method obtains preferably on multi-media events detection (Multimedia Event Detection, abbreviation MED) data set Effect.This method takes full advantage of advantage of the CNN in terms of still image feature extraction, can fear to take in video detection cruelly Obtain preferable effect, but there are still the following aspects that can be improved: (1) this method is during VLAD feature representation for view The temporal characteristics utilization of frequency is simultaneously insufficient.(2) this method is only extracted the CNN semantic feature of key frame, is not concerned with cruelly probably Other individualized features of video.To sum up, the video detecting method based on CNN semantic feature still has certain performance boost empty Between.
Summary of the invention
In order to solve the problems in the existing technology, the invention proposes one kind based on CNN and long memory unit in short-term (Long Short-term Memory's, abbreviation LSTM) fears video detecting method cruelly.This process employs CNN in characteristics of image Extraction and advantage of the LSTM in terms of time series expression, and fully consider and fear characteristic of the video in terms of scene cruelly, practical survey Testing index mAP value reaches 98.0% in examination, close to manual work level.In terms of the speed of service, accelerate only with single machine GPU Mode, 76.4 seconds network videos (average bit rate 632kbps) can be handled each second, be suitable for block cruelly probably video exist Propagation on Large video website is conducive to maintain social stability and national long-term stability.
It is found by the analysis to a large amount of sudden and violent probably videos, fears video cruelly in the two great spy of aspect of sequential organization and photographed scene Color.Based on this discovery, the present invention is on original video detection module (abbreviation CNN semantic modules) basis based on CNN semantic feature On, increase the video detection module (abbreviation CNN scene module) based on CNN scene characteristic and the time-series rules based on LSTM Module (abbreviation LSTM tfi module).For video to be detected, the present invention detects in terms of using semantic, scene and sequential organization three As a result the mode blended more comprehensively differentiates whether video relates to probably, reduces false detection rate, improve the practical valence of method Value.
It is provided by the invention that video detecting method is feared based on CNN and LSTM cruelly, specifically comprise the following steps:
The first step carries out crucial frame sampling to video to be detected, and extracts key frame feature;
Second step carries out the expression and differentiation of video level using the key frame feature extracted;Including CNN semanteme mould The differentiation of VLAD feature representation and SVM, the scene VLAD feature representation of CNN scene module and the SVM differentiation of block and LSTM timing The LSTM of module differentiates.
As a result third step merges.It is using the level convergence strategy based on checksum set mAP value, i.e., to be identified for one Video calculates separately the judgement score of three modules (CNN semantic modules, CNN scene module and LSTM tfi module), then with each MAP value of the module on checksum set is weighted fusion as weight.
Advantages of the present invention or beneficial effect are:
(1) timing information of video is singly had ignored using CNN semantic modules in the prior art.Video is feared cruelly to make full use of The characteristics of in terms of sequential organization, the present invention increase LSTM tfi module on the basis of original method.Test result table It is bright, timing information is introduced, it is more significant for the promotion of accuracy of identification.
(2) it the present invention is based on the statistics and analysis to extensive sudden and violent probably video sample, excavates to probably video is recording field cruelly It is very characteristic in terms of scape.Therefore, on the basis of original structure, CNN scene module is added to by the present invention fears video detection cruelly In, it ensure that the accuracy of identification under particular video frequency scene.
It is provided by the invention that video detecting method is feared based on CNN and LSTM cruelly, it is mainly used in government network supervision department Whether door and Large video website, the video for detecting user's upload are related to violence horrible content.Once it was found that video is doubtful It comprising such illegal contents, should give a warning in time, hand over manual review:
(1) present invention could apply in the series action of government network supervision department " root out online probably sudden and violent audio-video ", It is original based on artificial report on the basis of, detection is sampled using Online Video of the present invention for major video website, Rectification notice is issued for the video website found the problem, safeguards the safety of domestic internet environment.
It (2), both can be in user's uploaded videos present invention could apply in the content safety system of Large video website It filters out in the process and fears content cruelly, and can avoid for video is checked in stock because touching the red of content safety Line causes unnecessary loss to website.
Detailed description of the invention
Fig. 1 is video detecting method process frame diagram provided by the invention.
Fig. 2 is SPP feature extraction schematic diagram in the present invention.
Fig. 3 is LSTM neural unit structural schematic diagram in the present invention.
Specific embodiment
The present invention is described in detail with reference to the accompanying drawings and examples.
The present invention provides a kind of sudden and violent probably video detecting method based on CNN and LSTM, as shown in Figure 1, the video is examined Survey method specifically comprises the following steps:
The first step carries out crucial frame sampling to video to be detected, and extracts key frame feature;
(1) for video to be detected, crucial frame sampling at equal intervals is carried out first, and the sampling interval is 1 second, obtains key frame figure Picture.
(2) key frame images are down-sampled to 227 × 227, be input in CNN semantic model and CNN model of place, respectively Extract the CNN semantic feature and CNN scene characteristic of the key frame images.
The CNN semantic feature and CNN scene characteristic specifically includes FC6 feature, FC7 feature and SPP feature respectively again Three kinds of features.Wherein, FC6 feature and FC7 feature are common 4096 dimensional vector, and SPP characteristic extraction procedure is more special, under Face is described in detail.
Such as the SPP feature extraction schematic diagram provided in Fig. 2, SPP feature extraction is from Conv5 layers of (Conv5 full name Convolutional layer 5, i.e. the 5th layer of CNN model convolution) after, the Conv5 layers of spatial position for being sufficiently reserved target Information, but since its characteristic dimension is excessively high, it is not easy to directly utilize.To avoid this problem, first by Conv5 layers of characteristic pattern Sample carries out Spacial domain decomposition according to 1 × 1,2 × 2 and 3 × 3, and the side in maximum value pond is then used in each division region Method obtains the vectors of 14 256 dimensions (256D), and every one-dimensional characteristic of each vector corresponds to a certain explicit or implicit Semantic concept, i.e. SPP feature.
For each key frame images, three kinds of CNN semantic features are all extracted in the present invention, and (including SPP, FC6 and FC7 are special Sign) and three kinds of CNN scene characteristics (including SPP, FC6 and FC7 feature), they are then separately input to different videos on demand In level discrimination module, it is further processed.
Second step carries out the expression and differentiation of video level using the key frame feature extracted;
The video layer bread is containing three independent feature representations and differentiates, respectively the VLAD of CNN semantic modules is special Sign expression differentiates with SVM differentiation, the scene VLAD feature representation of CNN scene module and SVM and the LSTM of LSTM tfi module Differentiate.
The semantic VLAD feature representation and SVM of the CNN semantic modules differentiate that input feature vector is three kinds of CNN semantic special It levies (SPP, FC6, FC7).The side of principal component analysis (Principal Components Analysis, abbreviation PCA) is used first Three kinds of features are down to 128 dimensions respectively by method, 256 peacekeepings 256 are tieed up.
Then, using VLAD method, to the D dimensional feature vector after dimensionality reduction, to first passing through K- mean cluster (K-Means) in advance Obtained cluster centre set C={ c1,c2,...,cKCarry out difference accumulation projection.Enable V={ v1,v2,...,vNIndicate one Set comprising N number of feature vector of dimensionality reduction, then with cluster centre ckRelevant difference accumulation vector diffkIt can indicate are as follows:
Wherein, i=1,2 ..., N;K=1,2 ..., K.NN(vi) indicate dimensionality reduction feature vector viIn cluster centre set C In Euclidean distance arest neighbors.To each difference accumulation vector diffj(1≤j≤K) is carried out respectivelyNorm normalization, then K difference accumulation vector is cascaded, final K × D dimension VLAD feature representation has just been obtained.Cluster centre number K is set herein It is set to 256, then SPP, it is respectively 32,768 dimensions, 65,536 peacekeepings 65,536 that FC6, FC7, which correspond to the dimension after VLAD feature representation, Dimension.
Finally, training Linear SVM classifier completes the judgement that video relates to probably confidence level.Video VLAD feature representation is enabled to form Sample set be X={ x1,x2,...,xN, corresponding video classification (sudden and violent to fear, non-to fear cruelly) collection is combined into Y={ y1,y2,...,yN, Wherein yi∈ {+1, -1 } is converted by geometry margin maximization and solves convex double optimization problem, the segmentation learnt Hyperplane are as follows:
Wx+b=0 (2)
Wherein, w and b is respectively the slope and amount of bias for dividing hyperplane.It can will maximize the geometry of segmentation hyperplane Interval, is expressed as the optimization problem with inequality constraints condition:
Wherein, γ indicates sample point xiTo the geometric distance of segmentation hyperplane.The problem can be drawn by minimax method Ge Lang dual problem optimizes, and minimizes (Sequential Minimal Optimization, abbreviation by sequence SMO) algorithm is solved.The parameter w of optimal segmentation hyperplane is obtained after solution*And b*, then visual classification decision function is feared cruelly It can indicate are as follows:
F (x)=sign (w*·x+b*) (5)
Wherein, sign (x) indicates sign function.Current VLAD feature representation is identified as confidence level cruelly probably are as follows:
The VLAD feature representation of SPP, FC6, FC7 pass through Linear SVM classifier respectively, and it is semantic special finally to export three kinds of CNN Levy differentiation confidence level P corresponding to FC6, FC7 and SPP features (fc6), Ps (fc7) and Ps (spp)
The scene VLAD feature representation and SVM of the CNN scene module differentiate that input feature vector is that three kinds of CNN scenes are special It levies (SPP, FC6, FC7).The process flow of the module and semanteme VLAD feature representation and SVM discrimination module are almost the same, finally Export differentiation confidence level corresponding to three kinds of CNN scene characteristic FC6, FC7 and SPP featuresWith
The LSTM of the LSTM tfi module differentiates that input feature vector is two kinds of CNN semantic features (FC6, FC7).First Two category features are separately input in LSTM discrimination model.The model includes 2 layers of LSTM unit, and first layer includes 1024 nerves Member, the second layer include 512 neurons.The structure of each LSTM neuron is as shown in Figure 3.The forward conduction of LSTM neural unit Process can indicate are as follows:
it=σ (Wixt+Uiht-1+bi) (7)
ft=σ (Wfxt+Ufht-1+bf) (8)
ot=σ (Woxt+Uoht-1+ bo) (9)
ct=ft*ct-1+ it*φ(Wcxt+Ucht-1+bc) (10)
ht=ot*φ(ct) (11)
Wherein, two kinds of nonlinear activation functions are respectivelyWith φ (xt)=tanh (xt)。it, ft, ot And ctRespectively represent t moment input gate, Memory-Gate, quantity of state corresponding to out gate and core door.For each logic gate, Wi, Wf, WoAnd WcRespectively represent input gate, Memory-Gate, transferring weights matrix corresponding to out gate and core door, Ui, Uf, UoWith UcRespectively represent input gate, Memory-Gate, t-1 moment hidden layer variable h corresponding to out gate and core doort-1Corresponding weight Transfer matrix, bi,bf,bo,bcThen represent input gate, Memory-Gate, bias vector corresponding to out gate and core door.
Firstly, t moment input feature vector xtWith t-1 moment hidden layer variable ht-1, in transferring weights matrix W and U, and partially Under the collective effect for setting vector b, the quantity of state i of t moment is generatedt, ftAnd ot, see formula (7) to formula (9).Further in t-1 Moment core door state amount ct-1Auxiliary under, generate t moment core door state amount ct, see formula (10).Finally, in t moment core Ostium quantity of state ctWith out gate quantity of state otUnder the action of, generate t moment hidden layer variable ht, and then influence t+1 moment LSTM The interior change of neuron is shown in formula (11).
The output of second layer LSTM neuron is connected with full articulamentum classifier, two kinds of CNN semantic feature FC6 of final output Timing corresponding with FC7 feature differentiates confidence level Pt (fc6)And Pt (fc7)
As a result third step merges.
To guarantee fusion efficiencies, the level fusion based on checksum set mAP value is used in terms of result fusion (Hierarchical Fusion) strategy, i.e., video to be identified for one, calculate separately three modules (CNN semantic modules, CNN scene module and LSTM tfi module) judgement score, then using mAP value of each module on checksum set as weight carry out Weighted Fusion.In practical operation, the score for carrying out CNN semantic modules, CNN scene module and LSTM tfi module respectively first is melted It closes, is merged followed by global score:
Wherein, Ps, PpAnd PtRespectively represent the judgement based on CNN semantic modules, CNN scene module and LSTM tfi module Score;ωs、ωpAnd ωtRespectively CNN semantic modules, CNN scene module and the corresponding checksum set mAP value of LSTM tfi module; Ps (fc6)、Ps (fc7)And Ps (spp)The corresponding judgement score of FC6, FC7, SPP feature respectively in CNN semantic modules;WithThe corresponding checksum set mAP value of FC6, FC7, SPP feature respectively in CNN semantic modules;WithThe corresponding judgement score of FC6, FC7, SPP feature respectively in CNN scene module; WithThe corresponding checksum set mAP value of FC6, FC7, SPP feature respectively in CNN scene module;Pt (fc6)And Pt (fc7)Respectively The corresponding judgement score of FC6, FC7 feature in LSTM tfi module;WithFC6 respectively in LSTM tfi module, The corresponding checksum set mAP value of FC7 feature.Final cruelly probably video detection result (confidence level) PoIt is to be based on by three modules What the mode that mAP value is weighted obtained, see formula (15).

Claims (6)

1. a kind of fear video detecting method based on CNN and LSTM cruelly, it is characterised in that:
Specifically comprise the following steps:
The first step carries out crucial frame sampling to video to be detected, and extracts key frame feature;
Second step carries out the expression and differentiation of video level using the key frame feature extracted;Including CNN semantic modules VLAD feature representation and SVM differentiation, the scene VLAD feature representation of CNN scene module and SVM differentiation and LSTM tfi module LSTM differentiate;
The semantic VLAD feature representation and SVM of the CNN semantic modules differentiate that input feature vector is three kinds of CNN semantic features SPP, FC6, FC7;Three kinds of features are down to 128 dimensions respectively, 256 peacekeepings 256 are tieed up by the method for using principal component analysis first;With Afterwards, using VLAD method, to the D dimensional feature vector after dimensionality reduction, to the cluster centre set C for first passing through K- mean cluster in advance and obtaining ={ c1,c2,...,cKCarry out difference accumulation projection;Enable V={ v1,v2,...,vNIndicate one comprising N number of feature of dimensionality reduction to The set of amount, then with cluster centre ckRelevant difference accumulation vector diffkIt indicates are as follows:
Wherein, i=1,2 ..., N;K=1,2 ..., K;NN(vi) indicate dimensionality reduction feature vector viIn cluster centre set C The arest neighbors of Euclidean distance;To each difference accumulation vector diffjL is carried out respectively2Norm normalization, 1≤j≤K, then it is poor by K Divide accumulation vector cascade, just obtains final K × D dimension VLAD feature representation;Cluster centre number K is set to 256 herein, It is respectively 32,768 dimensions, 65,536 peacekeepings 65,536 dimension that then SPP, FC6, FC7, which correspond to the dimension after VLAD feature representation,;
Finally, training Linear SVM classifier completes the judgement that video relates to probably confidence level;
As a result third step merges: using the level convergence strategy based on checksum set mAP value, i.e., video to be identified for one divides Not Ji Suan CNN semantic modules, CNN scene module and LSTM tfi module judgement score, then with each module on checksum set MAP value is weighted fusion as weight.
2. a kind of sudden and violent probably video detecting method based on CNN and LSTM according to claim 1, it is characterised in that: first In step, the key frame sampling interval is 1 second, the CNN semantic feature and CNN scene characteristic that key frame feature includes, the CNN language Adopted feature and CNN scene characteristic specifically include three kinds of FC6 feature, FC7 feature and SPP feature features respectively again.
3. a kind of sudden and violent probably video detecting method based on CNN and LSTM according to claim 1 or 2, it is characterised in that: Conv5 layers of Eigen Structure is carried out area of space according to 1 × 1,2 × 2 and 3 × 3 first and drawn by SPP feature extraction from Conv5 layers Point, then it is each division region in using maximum value pond method obtain 14 256 tie up vectors, each vector it is every One-dimensional characteristic all corresponds to a certain explicit or implicit semantic concept, i.e. SPP feature.
4. a kind of sudden and violent probably video detecting method based on CNN and LSTM according to claim 1, it is characterised in that: described Training Linear SVM classifier complete video and relate to the judgement of probably confidence level, specifically: the sample for enabling video VLAD feature representation form This collection is combined into X={ x1,x2,...,xN, corresponding video category set is Y={ y1,y2,...,yN, wherein yi∈ {+1, -1 }, It is converted by geometry margin maximization and solves convex double optimization problem, the segmentation hyperplane learnt are as follows:
Wx+b=0 (2)
Wherein, w and b is respectively the slope and amount of bias for dividing hyperplane;The geometry interval of segmentation hyperplane will be maximized, indicated For the optimization problem with inequality constraints condition:
Wherein, γ indicates sample point xiTo the geometric distance of segmentation hyperplane;The problem passes through minimax method Lagrange duality Problem optimizes, and minimizes algorithm by sequence and solved;The parameter w of optimal segmentation hyperplane is obtained after solution* And b*, then probably visual classification decision function indicates cruelly are as follows:
F (x)=sign (w*·x+b*) (5)
Wherein, sign (x) indicates sign function;Current VLAD feature representation is identified as confidence level cruelly probably are as follows:
The VLAD feature representation of SPP, FC6, FC7 pass through Linear SVM classifier respectively, finally export three kinds of CNN semantic features Differentiation confidence level P corresponding to FC6, FC7 and SPP features (fc6), Ps (fc7)And Ps (spp)
5. a kind of sudden and violent probably video detecting method based on CNN and LSTM according to claim 1, it is characterised in that: second The LSTM of LSTM tfi module described in step differentiates that input feature vector is two kinds of CNN semantic features FC6, FC7;First by two classes Feature is separately input in LSTM discrimination model, which includes 2 layers of LSTM unit, and first layer includes 1024 neurons, the Two layers include 512 neurons;The forward conduction procedural representation of each LSTM neural unit are as follows:
it=σ (Wixt+Uiht-1+bi) (7)
ft=σ (Wfxt+Ufht-1+bf) (8)
ot=σ (Woxt+Uoht-1+bo) (9)
ct=ft*ct-1+it*φ(Wcxt+Ucht-1+bc) (10)
ht=ot*φ(ct) (11)
Wherein, two kinds of nonlinear activation functions are respectivelyWith φ (xt)=tanh (xt);it, ft, otAnd ct Respectively represent t moment input gate, Memory-Gate, quantity of state corresponding to out gate and core door;For each logic gate, Wi, Wf, WoAnd WcRespectively represent input gate, Memory-Gate, transferring weights matrix corresponding to out gate and core door;Ui, Uf, UoAnd Uc Respectively represent input gate, Memory-Gate, t-1 moment hidden layer variable h corresponding to out gate and core doort-1Corresponding weight turns Move matrix, bi,bf,bo,bcThen represent input gate, Memory-Gate, bias vector corresponding to out gate and core door;htIt is hidden for t moment Hide layer variable;
The output of second layer LSTM neuron is connected with full articulamentum classifier, two kinds of CNN semantic feature FC6 of final output and The corresponding timing of FC7 feature differentiates confidence level Pt (fc6)And Pt (fc7)
6. a kind of sudden and violent probably video detecting method based on CNN and LSTM according to claim 1, it is characterised in that: third Result fusion is walked, carries out the score fusion of CNN semantic modules, CNN scene module and LSTM tfi module respectively first, then again Carry out global score fusion:
Wherein, Ps, PpAnd PtRespectively represent the judgement score based on CNN semantic modules, CNN scene module and LSTM tfi module; ωs、ωpAnd ωtRespectively CNN semantic modules, CNN scene module and the corresponding checksum set mAP value of LSTM tfi module;Ps (fc6)、Ps (fc7)And Ps (spp)The corresponding judgement score of FC6, FC7, SPP feature respectively in CNN semantic modules; WithThe corresponding checksum set mAP value of FC6, FC7, SPP feature respectively in CNN semantic modules;WithThe corresponding judgement score of FC6, FC7, SPP feature respectively in CNN scene module; WithRespectively The corresponding checksum set mAP value of FC6, FC7, SPP feature in CNN scene module;Pt (fc6)And Pt (fc7)Respectively LSTM tfi module The corresponding judgement score of middle FC6, FC7 feature;WithRespectively FC6, FC7 feature are corresponding in LSTM tfi module Checksum set mAP value;Final cruelly probably video detection result PoIt is to be obtained in such a way that three modules are weighted based on mAP value 's.
CN201610168334.9A 2016-03-23 2016-03-23 It is a kind of that video detecting method is feared based on CNN and LSTM cruelly Active CN105844239B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610168334.9A CN105844239B (en) 2016-03-23 2016-03-23 It is a kind of that video detecting method is feared based on CNN and LSTM cruelly

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610168334.9A CN105844239B (en) 2016-03-23 2016-03-23 It is a kind of that video detecting method is feared based on CNN and LSTM cruelly

Publications (2)

Publication Number Publication Date
CN105844239A CN105844239A (en) 2016-08-10
CN105844239B true CN105844239B (en) 2019-03-29

Family

ID=56584468

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610168334.9A Active CN105844239B (en) 2016-03-23 2016-03-23 It is a kind of that video detecting method is feared based on CNN and LSTM cruelly

Country Status (1)

Country Link
CN (1) CN105844239B (en)

Families Citing this family (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106411597A (en) * 2016-10-14 2017-02-15 广东工业大学 Network traffic abnormality detection method and system
CN106997371B (en) * 2016-10-28 2020-06-23 华数传媒网络有限公司 Method for constructing single-user intelligent map
CN106548208B (en) * 2016-10-28 2019-05-28 杭州米绘科技有限公司 A kind of quick, intelligent stylizing method of photograph image
US20180129937A1 (en) * 2016-11-04 2018-05-10 Salesforce.Com, Inc. Quasi-recurrent neural network
CN108073933B (en) 2016-11-08 2021-05-25 杭州海康威视数字技术股份有限公司 Target detection method and device
CN106782602B (en) * 2016-12-01 2020-03-17 南京邮电大学 Speech emotion recognition method based on deep neural network
CN106599198B (en) * 2016-12-14 2021-04-06 广东顺德中山大学卡内基梅隆大学国际联合研究院 Image description method of multi-cascade junction cyclic neural network
CN106780491B (en) * 2017-01-23 2020-03-17 天津大学 Initial contour generation method adopted in segmentation of CT pelvic image by GVF method
CN106846346B (en) * 2017-01-23 2019-12-20 天津大学 Method for rapidly extracting pelvis outline of sequence CT image based on key frame mark
CN108229522B (en) * 2017-03-07 2020-07-17 北京市商汤科技开发有限公司 Neural network training method, attribute detection device and electronic equipment
US10152627B2 (en) 2017-03-20 2018-12-11 Microsoft Technology Licensing, Llc Feature flow for video recognition
CN107016356A (en) * 2017-03-21 2017-08-04 乐蜜科技有限公司 Certain content recognition methods, device and electronic equipment
CN107038221B (en) * 2017-03-22 2020-11-17 杭州电子科技大学 Video content description method based on semantic information guidance
CN106951783B (en) * 2017-03-31 2021-06-01 国家电网公司 Disguised intrusion detection method and device based on deep neural network
CN107256221B (en) * 2017-04-26 2020-11-03 苏州大学 Video description method based on multi-feature fusion
CN107092254B (en) * 2017-04-27 2019-11-29 北京航空航天大学 A kind of design method of the Household floor-sweeping machine device people based on depth enhancing study
CN107092894A (en) * 2017-04-28 2017-08-25 孙恩泽 A kind of motor behavior recognition methods based on LSTM models
CN107392097B (en) * 2017-06-15 2020-07-07 中山大学 Three-dimensional human body joint point positioning method of monocular color video
CN107341462A (en) * 2017-06-28 2017-11-10 电子科技大学 A kind of video classification methods based on notice mechanism
CN107274378B (en) * 2017-07-25 2020-04-03 江西理工大学 Image fuzzy type identification and parameter setting method based on fusion memory CNN
CN107480726A (en) * 2017-08-25 2017-12-15 电子科技大学 A kind of Scene Semantics dividing method based on full convolution and shot and long term mnemon
CN109460812B (en) * 2017-09-06 2021-09-14 富士通株式会社 Intermediate information analysis device, optimization device, and feature visualization device for neural network
CN107818084B (en) * 2017-10-11 2021-03-09 北京众荟信息技术股份有限公司 Emotion analysis method fused with comment matching diagram
CN107895172A (en) * 2017-11-03 2018-04-10 北京奇虎科技有限公司 Utilize the method, apparatus and computing device of image information detection anomalous video file
CN107885853A (en) * 2017-11-14 2018-04-06 同济大学 A kind of combined type file classification method based on deep learning
CN108053410B (en) * 2017-12-11 2020-10-20 厦门美图之家科技有限公司 Moving object segmentation method and device
CN108009539B (en) * 2017-12-26 2021-11-02 中山大学 Novel text recognition method based on counting focusing model
CN108289248B (en) * 2018-01-18 2020-05-15 福州瑞芯微电子股份有限公司 Deep learning video decoding method and device based on content prediction
CN108419091A (en) * 2018-03-02 2018-08-17 北京未来媒体科技股份有限公司 A kind of verifying video content method and device based on machine learning
CN108228915B (en) * 2018-03-29 2021-10-26 华南理工大学 Video retrieval method based on deep learning
CN110555488A (en) * 2018-06-04 2019-12-10 北京京东尚科信息技术有限公司 Image sequence auditing method and system, electronic equipment and storage medium
CN110166826B (en) * 2018-11-21 2021-10-08 腾讯科技(深圳)有限公司 Video scene recognition method and device, storage medium and computer equipment
CN111368071A (en) * 2018-12-07 2020-07-03 北京奇虎科技有限公司 Video detection method and device based on video related text and electronic equipment
CN111291602A (en) * 2018-12-07 2020-06-16 北京奇虎科技有限公司 Video detection method and device, electronic equipment and computer readable storage medium
CN109858540B (en) * 2019-01-24 2023-07-28 青岛中科智康医疗科技有限公司 Medical image recognition system and method based on multi-mode fusion
CN109817338A (en) * 2019-02-13 2019-05-28 北京大学第三医院(北京大学第三临床医学院) A kind of chronic disease aggravates risk assessment and warning system
CN109961041B (en) * 2019-03-21 2021-03-23 腾讯科技(深圳)有限公司 Video identification method and device and storage medium
CN110046226B (en) * 2019-04-17 2021-09-24 桂林电子科技大学 Image description method based on distributed word vector CNN-RNN network
CN110647905B (en) * 2019-08-02 2022-05-13 杭州电子科技大学 Method for identifying terrorist-related scene based on pseudo brain network model
CN110929762B (en) * 2019-10-30 2023-05-12 中科南京人工智能创新研究院 Limb language detection and behavior analysis method and system based on deep learning
CN111222320B (en) * 2019-12-17 2020-10-20 共道网络科技有限公司 Character prediction model training method and device
CN113010735B (en) * 2019-12-20 2024-03-08 北京金山云网络技术有限公司 Video classification method and device, electronic equipment and storage medium
CN112115984A (en) * 2020-08-28 2020-12-22 安徽农业大学 Tea garden abnormal data correction method and system based on deep learning and storage medium
CN113095183A (en) * 2021-03-31 2021-07-09 西北工业大学 Micro-expression detection method based on deep neural network
CN115089206B (en) * 2022-05-09 2023-02-10 吴先洪 Method for predicting heart sound signal and heart auscultation device using same

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103218608A (en) * 2013-04-19 2013-07-24 中国科学院自动化研究所 Network violent video identification method
CN103473555A (en) * 2013-08-26 2013-12-25 中国科学院自动化研究所 Horrible video scene recognition method based on multi-view and multi-instance learning
CN105005772A (en) * 2015-07-20 2015-10-28 北京大学 Video scene detection method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103218608A (en) * 2013-04-19 2013-07-24 中国科学院自动化研究所 Network violent video identification method
CN103473555A (en) * 2013-08-26 2013-12-25 中国科学院自动化研究所 Horrible video scene recognition method based on multi-view and multi-instance learning
CN105005772A (en) * 2015-07-20 2015-10-28 北京大学 Video scene detection method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"A Discriminative CNN Video Representation for Event Detection";Zhangwen Xu 等;《IEEE Conference on Computer Vision and Pattern Recognition(CVPR)》;20150612;第一至最后一段
"Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classfication";Zuxuan Wu 等;《Proceedings of the 23rd ACM international conference on Multimedia》;20151030;第461-470页
SPP_mask阅读报告;yqwang2006;《百度文库》;20150922;正文1-5页

Also Published As

Publication number Publication date
CN105844239A (en) 2016-08-10

Similar Documents

Publication Publication Date Title
CN105844239B (en) It is a kind of that video detecting method is feared based on CNN and LSTM cruelly
Gao et al. Multi‐dimensional data modelling of video image action recognition and motion capture in deep learning framework
CN110097000A (en) Video behavior recognition methods based on local feature Aggregation Descriptor and sequential relationship network
Deng et al. Pcgan: A noise robust conditional generative adversarial network for one shot learning
Wang et al. Category-specific semantic coherency learning for fine-grained image recognition
Li et al. MsKAT: Multi-scale knowledge-aware transformer for vehicle re-identification
CN110765285A (en) Multimedia information content control method and system based on visual characteristics
Yuan et al. Few-shot scene classification with multi-attention deepemd network in remote sensing
Gayathri et al. Improved fuzzy-based SVM classification system using feature extraction for video indexing and retrieval
Pan et al. Hybrid dilated faster RCNN for object detection
Huang et al. Pedestrian detection using RetinaNet with multi-branch structure and double pooling attention mechanism
Lei et al. Reducing background induced domain shift for adaptive person re-identification
Wei et al. Reinforced domain adaptation with attention and adversarial learning for unsupervised person Re-ID
Zhang [Retracted] Sports Action Recognition Based on Particle Swarm Optimization Neural Networks
Du An anomaly detection method using deep convolution neural network for vision image of robot
Liang et al. A new object detection method for object deviating from center or multi object crowding
CN116958740A (en) Zero sample target detection method based on semantic perception and self-adaptive contrast learning
CN110363164A (en) A kind of unified approach based on LSTM time consistency video analysis
Xia et al. Semantic features and high-order physical features fusion for action recognition
Zhou Deep learning based people detection, tracking and re-identification in intelligent video surveillance system
Zou et al. Research on human movement target recognition algorithm in complex traffic environment
XiaoFan et al. Introduce GIoU into RFB net to optimize object detection bounding box
Xiong et al. Domain adaptation of object detector using scissor-like networks
CN109857884B (en) Automatic image semantic description method
Luo et al. Cross-Domain Person Re-Identification Based on Feature Fusion

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant