CN105701480A - Video semantic analysis method - Google Patents

Video semantic analysis method Download PDF

Info

Publication number
CN105701480A
CN105701480A CN201610107770.5A CN201610107770A CN105701480A CN 105701480 A CN105701480 A CN 105701480A CN 201610107770 A CN201610107770 A CN 201610107770A CN 105701480 A CN105701480 A CN 105701480A
Authority
CN
China
Prior art keywords
layer
video
decoder
model
training set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610107770.5A
Other languages
Chinese (zh)
Other versions
CN105701480B (en
Inventor
詹永照
詹智财
张建明
彭长生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JIANGSU KING INTELLIGENT SYSTEM CO Ltd
Jiangsu University
Original Assignee
JIANGSU KING INTELLIGENT SYSTEM CO Ltd
Jiangsu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JIANGSU KING INTELLIGENT SYSTEM CO Ltd, Jiangsu University filed Critical JIANGSU KING INTELLIGENT SYSTEM CO Ltd
Priority to CN201610107770.5A priority Critical patent/CN105701480B/en
Publication of CN105701480A publication Critical patent/CN105701480A/en
Application granted granted Critical
Publication of CN105701480B publication Critical patent/CN105701480B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/26Techniques for post-processing, e.g. correcting the recognition result
    • G06V30/262Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
    • G06V30/274Syntactic or semantic context, e.g. balancing

Abstract

The present invention provides a video semantic analysis method. The method comprises the following steps: S1, performing preprocessing of a video training set, and constructing a sparse linear decoder; S2, adding topological property constraints to build a topology linear decoder, and performing image blocking processing of the video training set to train the topology linear decoder; S3, taking the parameters of a trained topology linear decoder as initial parameters of a convolution layer in a convolution nerve network; and S4, building a key frame set to perform fin adjustment of the convolution nerve network based on the video training set through adoption of a multifold cross validation mode, and building a general feature extractor on the base of the video data, and inputting the features extracted on the training set and a test set to support the video semantic classification in a vector machine. The model training method provided by the invention has video-type data samples capable of responding to various contents so as to improve the accuracy and the robustness of the model.

Description

A kind of Video Semantic Analysis method
Technical field
The present invention relates to video semanteme detection technique field, in particular to a kind of Video Semantic Analysis method。
Background technology
In order to realize the detection of video semantic classification, employ the method that the key frame set to video of the convolutional neural networks model carries out feature extraction, experiment proves to be different from the extracting mode of other manual designs feature, convolutional neural networks model itself is go out distributed nature from extracting data, and what namely obtain being characterized by data-driven form enables adaptation to wider array of field。But convolutional neural networks is supervised learning model, namely when convolutional neural networks model is trained, need training dataset, it is also required to the label that training dataset is corresponding, and the convergence of convolutional neural networks is also required to the continuous iteration of substantial amounts of sample, this is for the task such as classification and Detection of the video data of magnanimity, it is impossible to obtain the label that each video is corresponding。
It is directed on video data and adopts the convolutional neural networks model with Training characteristic, although forefathers, based on without the method proposed on the basis of supervised training without supervision pre-training, solve the problem that traditional convolutional neural networks convergence is slow;And it is compared to image data, video data can have the rotation of same target in terms of content, convergent-divergent, the phenomenons such as translation, this feature extractor being accomplished by using can capture the feature of more complicated invariance, so how well extracts the feature with stronger invariance and has become required solution problem。
Summary of the invention
Present invention aim at providing a kind of Video Semantic Analysis method, by combining without the advantage of supervision pre-training method and topological property so that convolutional neural networks can use less than ever has an exemplar, and can speed up and converge to stationary value。And the introducing based on topological property so that model can extract has the translation of higher reply target, object convergent-divergent, the feature that object rotates, and improves accuracy and robustness that semantic analysis is detected by model。
In order to solve above technical problem, the concrete technical scheme that the present invention adopts is as follows:
A kind of Video Semantic Analysis method, it is characterised in that comprise the following steps:
S1: video training set is carried out pretreatment, and builds sparse linear decoder;
S2: add topological property constraint on sparse linear decoder and obtain linear topological decoder, and video training set is carried out the fragmental image processing foundation training set based on image block thus training linear topological decoder;
S3: using the weight parameter of linear topological decoder that trains as the initial parameter of convolutional layer in convolutional neural networks;
S4: adopt the mode of many times of cross validations, and set up key frame set based on video training set convolutional neural networks is finely tuned, set up a generic features extractor based on video data, finally the feature extracted in training set and test set is input in support vector machine and carries out the classification based on video semanteme。
In described sparse linear decoder model construction process, first one line decoder model of definition, then weight decay and sparse regularization term are introduced on the mold, adjust the important dependency between this regular terms and whole object function by corresponding term coefficient, implement process as follows:
Process S11: make the number of videos m in video training set represent, wherein the total mF of mf video(mf)Two field picture frame, and the label of this video is y(mf);First all picture frames of this m video are extracted, and make each picture frame be sized to n × n × 3, the wherein width of n each image frame and height, 3 represent that what adopt is RGB color standard;Setting up sliding sash to be sized to k × k, sliding step is p, then slided by sliding sash, and piece image frame can extract altogetherIndividual image block, whole video training set can be extracted altogetherIndividual image block;Each image block is pulled into the vector x that length is k × k × 3, and out of order by carrying out between all of image block, and be bS training sample by every batch, it being divided into nbS=M/bS batch, the data set finally obtained is as the training set of training linear topological decoder;
Process S12: first define the model of line decoder, is input layer by ground floor, and the second layer is hidden layer, and third layer is that output layer is constituted, wherein every layer of neuron number respectively nL1, nL2, nL3, wherein nL1=nL3;Weight parameter between ground floor, the second interlayer and the second layer, third layer is respectivelyWithRepresent the neuronic bias of jth of weights and the n-th l+1 layer connected between the jth neuron of the n-th l+1 layer and the i-th neuron of the n-th l layer, nl ∈ { 1,2} respectively;The neuronic activation primitive of the second layer is formula (1):
a j ( 2 ) = f ( 2 ) ( z j ( 2 ) ) = e z j ( 2 ) - e - z j ( 2 ) e z j ( 2 ) + e - z j ( 2 ) - - - ( 1 )
For the neuronic output of second layer jth,For the neuronic input type of second layer jth (2):
z j ( 2 ) = Σ i = 1 n L 1 w j i ( 1 ) + a i ( 1 ) + b j ( 1 ) - - - ( 2 )
For each neuronic output of ground floor, be each element value of image block vector here, i.e. a(1)=x;The neuronic activation primitive of third layer is formula (3):
x ^ = a j ( 3 ) = f ( 3 ) ( z j ( 3 ) ) = z j ( 3 ) - - - ( 3 )
Namely the neuronic activation primitive of third layer is the second layer each neuronic linear combination such as formula (4)
z j ( 3 ) = Σ i = 1 n L 2 w j i ( 2 ) × a i ( 2 ) + b j ( 2 ) - - - ( 4 )
Obtain autocoder target function value such as formula (5)
J ( W , b : x , x ^ ) = 1 2 b S | | x ^ - x | | 2 - - - ( 5 )
WhereinThe output vector obtained it is input in this model for x;
Process S13: after setting up most basic line decoder, the over-fitting problem caused to prevent weight explosion phenomenon, object function increases weight attenuation term, obtains object function such as formula (6)
J w - d e c a y ( W , b : x , x ^ ) = 1 2 b S | | x ^ - x | | 2 + λ 2 Σ l = 1 N l - 1 Σ i = 1 s l Σ j = 1 s l + 1 ( w j i ( l ) ) 2 - - - ( 6 )
Wherein Nl is the number of plies of model, here Nl=3;SlIt it is the neuronic number of the n-th l layer;Sl+ 1 is the neuronic number of the n-th l+1 layer;λ is the balance coefficient of weight attenuation term and the important dependency of whole object function;
Process S14: on the basis of S13, this model is carried out the introducing of sparse characteristic, namely for the neuron of hidden layer, most neurons activation degree in each sample input process reaches inhibitory state close to-1, only the neuronic activation degree of small part is close to 1, thus extracting the openness feature of input data;Increasing sparse regular terms on object function, obtaining object function is:
J s p a r s e ( W , b : x , x ^ ) = J w - d e c a y ( W , b : x , x ^ ) + β Σ j = 1 s 2 L 1 ( ρ | | ρ ^ j ) - - - ( 7 )
That is, this sparse regularization term is to allow each neuronic average activation degree of hidden layerCan lower than certain value, wherein each neuronic average activation degree is:
ρ ^ j = 1 b S Σ i = 1 b S [ a j ( 2 ) ( x ( i ) ) ] - - - ( 8 )
Formula (8) represents on the basis of i-th input sample, the average of each neuronic activation value of hidden layer, and ρ is sparse term coefficient, is used for controlling the average value activating degree of hidden layer;Can close to set value by carrying out the activation degree of limited model hidden layer with L1 canonical formula:
L 1 ( ρ | | ρ ^ j ) = | | ρ ^ j - ρ | | - - - ( 9 ) .
Described linear topological decoder is built upon on sparse linear decoder basis, by the neuronic activation situation of hidden layer is carried out topological constraints, this model is made to become a linear topological decoder, namely by the neuron of hidden layer is grouped in order, the neuron in same group is made to have similar activation degree, and the neuron of difference group is independent mutually so that this model can learn the topological property in data, and it is as follows that it realizes process:
Process S21: after process S14, just obtains a sparse linear decoder;Process S14 is based on the basis of process S13, uses L1 canonical formula to be limited near certain value all of for hidden layer neuronic average activation degree。Here topology is by being first first grouped by all neurons of hidden layer。Namely for model, the second layer has nL2 neuron, then all of neuron is arranged in oneMatrix, it is designated as topology packet selection matrix T, in this matrix, the activation situation of any point all can be subject to centered by this point, neuronic impact in the scope of sk × sk size, namely centered by certain point, the conduct within the scope of periphery sk × sk one group, because hidden layer neuron one has nL2, so being divided into nL2 group altogether;
By using the quadratic sum of the same group of all neuronic activation values desired value as this group。Namely the object function obtaining linear topological decoder is:
J t o p o ( W , b : x , x ^ ) = J s p a r s e ( W , b : x , x ^ ) + 1 b S γ Σ V × S . ^ 2 + ϵ - - - ( 10 )
Wherein V is the packet matrix of nL2 × nL2 size, and building process is: for each of which group, i.e. every row vector, and first definition one is grouped an equal amount of labelling matrix F of selection matrix T based on topology;
F i j ( t ) = { 1 i , j ∈ Sg ( t ) 0 o t h e r , t ∈ [ 0 , n L 2 - 1 ] - - - ( 11 )
Represent the value that in V, in the labelling matrix of t group, the i-th row jth arranges;Sg(t)It it is the topology selection region of t group
Sg ( t ) = [ r S t : r S e , c S t : c S e ] , r S t = mod ( t / n L 2 + 0 , n L 2 ) r S e = mod ( t / n L 2 + s k , n L 2 ) c S t = mod ( mod ( t , n L 2 ) + 0 , n L 2 ) c S e = mod ( mod ( t , n L 2 ) + s k , n L 2 ) - - - ( 12 )
Mod function is mod;Have hence for packet matrix:
V ( t , i × n L 2 + j ) = F i j ( t ) - - - ( 13 ) ,
t∈[0,nL2-1], i ∈ [ 0 , n L 2 - 1 ] , j ∈ [ 0 , n L 2 - 1 ]
Namely as V, (r, when c)=1, represents that the c neuron belongs to r group;In formula (10), S is the matrix of nL2 × bS size of hidden layer neuron composition, and ε is to prevent singular value from opening the smoothing parameter of root;γ is the balance coefficient of topology regular terms and the important dependency of whole object function;
Process S22: constitute the matrix of a nP × vS by the image block of all frame of video in the training set that obtained by process S11, wherein vS represents the neuron number of topology sparse linear decoder input layer, namely, vS=nL1=k × k × 3, are a number based on all pixels of RGB triple channel sliding sash;Model intermediate layer is hidden layer, is also after this model training, using the output valve of this layer as inputting characteristic of correspondence value;Because the nP × vS matrix constituted is excessive, so first this matrix being divided into multiple batches according to the size of bS × vS, adopting BP algorithm once to train one batch, the training of all of training data has once represented an epoch;Train multiple epoch to reach the purpose of model convergence。
The weight parameter of the described linear topological decoder trained is as the initial parameter of convolutional neural networks, and has exemplar fine setting convolutional neural networks on a small quantity thus obtaining more excellent parameter by follow-up, implements process as follows:
Process S31: the mode input layer making convolutional neural networks is video frame image, i.e. n × n × 3;For convolutional layer, same convolutional layer has multiple characteristic pattern, each characteristic pattern shares same convolution kernel, the receptive field size of each convolution kernel is k × k × 3, adopting the mode being entirely connected between convolutional layer with front layer, namely each characteristic pattern of convolutional layer can be associated with each characteristic pattern of front layer:
x j l = f ( Σ i ∈ M j x i l - 1 * w i j l + b j l ) - - - ( 14 )
Represent the jth characteristic pattern of l layer;Represent the ith feature figure of l-1 layer;Represent the connection weight between jth characteristic pattern and the ith feature figure of l-1 layer of l layer;Represent the biasing of l layer jth characteristic pattern;
Process S32: the structure of the linear topological decoder trained by process S22 is nL1, nL2, nL3, wherein each neuron of the hidden layer of linear topological decoder is also full type of attachment with each neuron of input layer, as shown in formula (2) and formula (3), by one weight assignment between hidden unit with input layer of hidden layer in linear topological decoder to all of pixel, i.e. weighted value on convolution kernel on the pixel corresponding front layer receptive field on characteristic pattern each in the convolutional layer of convolutional neural networks。
The described generic features extractor based on video of setting up is to finely tune thus obtain by the new training set that the multiframe key frame in video training set forms to convolutional neural networks model by the mode of many times of cross validations, after obtaining this generic features extractor, the feature extracted in training set and test set is input in support vector machine and carries out the classification based on video semanteme, it is achieved process is as follows:
Process S41: adopt the mode of many times of cross validations, video set is divided into training set and test set, said process is to complete in all frame of video of training set, here first to all videos of training set to carry out choosing of frame of video every sF frame, using these frames key frame as this video, even the total mF of mf video(mf)Two field picture frame, then with 1:sF:mF(mf)The image key frame that video frame indicia is this video, and video classification y on the labelling of correspondence(mf), then all key frames of training set video are finely tuned as convolutional neural networks model data set;
Process S42: using the Softmax top layer model as convolutional neural networks model, by BP algorithm, convolutional neural networks model is finely tuned until restraining。Top layer Softmax layer is removed, it is thus achieved that about the generic features extractor of this sets of video data, and to make the output layer unit number of convolutional neural networks be nLo;
Process S43: carry out the feature extraction of convolutional neural networks model on the key frame of video of the process S41 training set obtained and test set, make the mf video have key frame to be mKF(mf)Frame, then each video obtains the eigenmatrix of mKF × nLo, and wherein row represents the number of key frame, and the feature extracted on the key frame of correspondence is shown in list。The row of this eigenmatrix is divided into pS part, then every part is the matrix of (mKF/pS) × nLo, the i.e. matrix of mKF/pS row nLo row, this matrix is carried out averaging with behavior axle, obtaining this part of upper length is the characteristic vector of nLo, by the characteristic vector of different piece being joined end to end, obtain characteristic vector that length the is nLo × pS characteristic vector as this video;
Process S44: aforementioned process respectively obtains eigenmatrix and the label matrix of training set and test set, puts into this characteristic in supporting vector machine model and carries out last semantic concept prediction。
The present invention has beneficial effect。The present invention is by combining topological property with without supervision pre-training learning method, having on the basis of supervision sample on a small quantity, overcome convolutional neural networks training need and have exemplar in a large number, and restrain slower problem, and the accuracy of this model and robustness are higher than the model not using pre-training, and it is more suitable for the target translation of video data, object convergent-divergent, the characteristic such as object rotation。When adopting the feature based on the convolutional neural networks model extraction of topological model pre-training that video semanteme is analyzed, it is effectively improved the model accuracy to Video Semantic Analysis。
Accompanying drawing explanation
Fig. 1 is the structure flow chart of linear topological decoder。
Fig. 2 is the schematic flow sheet carrying out Video Semantic Analysis detection。
Fig. 3 is the schematic diagram of linear topological decoder。
Detailed description of the invention
Below in conjunction with the drawings and specific embodiments, technical scheme is described in further details。
With reference to shown in Fig. 1 and Fig. 2, the preferred embodiment according to the present invention, the Video Semantic Analysis method based on topological model pre-training convolutional neural networks comprises the following steps S1: video training set carries out pretreatment, and builds sparse linear decoder;S2: add topological property constraint building topology line decoder, and video training set is carried out fragmental image processing thus training linear topological decoder;S3: using the parameter of linear topological decoder that trains as the initial parameter of convolutional layer in convolutional neural networks;S4: adopt the mode of many times of cross validations and set up key frame set based on video training set convolutional neural networks is finely tuned, set up a generic features extractor based on video data, finally the feature extracted in training set and test set is input in support vector machine and carries out the classification based on video semanteme。
With reference to Fig. 1, Fig. 3, in the construction process of afore-mentioned topology line decoder, first one line decoder model of definition, then introduce topology regularization term on the mold, adjusting the important dependency between this regular terms and whole object function by corresponding term coefficient, it is as follows that it realizes process:
Process S11: make the number of videos m in video training set represent, wherein the total mF of mf video(mf)Two field picture frame, and the label of this video is y(mf)。First all picture frames of this m video are extracted, and make each picture frame be sized to n × n × 3, the wherein width of n each image frame and height, 3 represent that what adopt is RGB color standard。Setting up sliding sash to be sized to k*k, sliding step is p, then slided by sliding sash, and piece image frame can extract altogetherIndividual image block, whole video training set can be extracted altogetherIndividual image block。Each image block is pulled into the vector x that length is k × k × 3, and out of order by carrying out between all of image block, and be bS training sample by every batch, it being divided into nbS=M/bS batch, the data set finally obtained is as the training set of training linear topological decoder;
Process S12: first define the model of line decoder, this model is input layer by ground floor, and the second layer is hidden layer, and third layer is that output layer is constituted, wherein every layer of neuron number respectively nL1, nL2, nL3, wherein nL1=nL3。The weight parameter of first and second interlayer and second and third interlayer is respectivelyWithRepresent the neuronic bias of jth of weights and the n-th l+1 layer connected between the jth neuron of the n-th l+1 layer and the i-th neuron of the n-th l layer, wherein nl ∈ { 1,2} respectively。The neuronic activation primitive of the second layer is:
Formula (1): a j ( 2 ) = f ( 2 ) ( z j ( 2 ) ) = e z j ( 2 ) - e - z j ( 2 ) e z j ( 2 ) + e - z j ( 2 )
WhereinFor the neuronic output of second layer jth,For the neuronic input of second layer jth:
Formula (2): z j ( 2 ) = Σ i = 1 n L 1 w j i ( 1 ) × a i ( 1 ) + b j ( 1 )
WhereinFor each neuronic output of ground floor, be each element value of image block vector here, i.e. a(1)=x。The neuronic activation primitive of third layer is:
Formula (3): x ^ = a j ( 3 ) = f ( 3 ) ( z j ( 3 ) ) = z j ( 3 )
Namely the neuronic activation primitive of third layer is each neuronic linear combination of the second layer:
Formula (4): z j ( 3 ) = Σ i = 1 n L 2 w j i 2 × a i ( 2 ) + b j ( 2 )
Obtain autocoder target function value:
Formula (5): J ( W , b : x , x ^ ) = 1 2 b S | | x ^ - x | | 2
WhereinThe output vector obtained it is input in this model for x。
Process S13: after setting up most basic line decoder, the over-fitting problem caused to prevent weight explosion phenomenon, object function increases weight attenuation term, obtaining object function is:
Formula (6): J w - d e c a y ( W , b : x , x ^ ) = 1 2 b S | | x ^ - x | | 2 + λ 2 Σ l = 1 N l - 1 Σ i = 1 s l Σ j = 1 s l + 1 ( w j i ( l ) ) 2
Wherein Nl is the number of plies of model, here Nl=3;SlIt it is the neuronic number of the n-th l layer;Sl+ 1 is the neuronic number of the n-th l+1 layer;λ is the balance coefficient of weight attenuation term and the important dependency of whole object function。
Process S14: on the basis of S13, this model is carried out the introducing of sparse characteristic, namely for the neuron of hidden layer, most neurons activation degree in each sample input process reaches inhibitory state close to-1, only the neuronic activation degree of small part is close to 1, thus extracting the openness feature of input data。Increasing sparse regular terms on object function, obtaining object function is:
Formula (7): J s p a r s e ( W , b : x , x ^ ) = J w - d e c a y ( W , b : x , x ^ ) + β Σ j = 1 s 2 L 1 ( ρ | | ρ ^ j )
Namely so that for each neuronic average activation degree of hidden layerCan lower than certain value:
Formula (8): ρ ^ j = 1 b S Σ i = 1 b S [ a j ( 2 ) ( x ( i ) ) ]
Input on the basis of sample shown herein as i-th, the average of each neuronic activation value of hidden layer, and ρ is sparse term coefficient, is used for controlling the average value activating degree of hidden layer。Can close to set value by carrying out the activation degree of limited model hidden layer with L1 canonical formula:
Formula (9): L 1 ( ρ | | ρ ^ j ) = | | ρ ^ j - ρ | |
In the present embodiment, with reference to shown in Fig. 1, it is preferred that on the sparse linear decoder basis established, by the neuronic activation situation of hidden layer is carried out topological constraints, this model is made to become a linear topological decoder, namely by the neuron of hidden layer is grouped in order so that the neuron in same group has similar activation degree, and the neuron of different group is independent mutually, making this model can learn the topological property in data, it is as follows that it realizes process:
Process S21: after step S14, just obtains a sparse linear decoder。Process S14 is based on the basis of process S13, uses L1 canonical formula to be limited near certain value all of for hidden layer neuronic average activation degree。Here topology is by being first first grouped by all neurons of hidden layer。Namely for model, the second layer has nL2 neuron, then all of neuron is arranged in oneMatrix, it is designated as topology packet selection matrix T, in this matrix, the activation situation of any point all can be subject to centered by this point, neuronic impact in the scope of sk × sk size, namely centered by certain point, the conduct within the scope of periphery sk × sk one group, because hidden layer neuron one has nL2, so being divided into nL2 group altogether。By using the quadratic sum of the same group of all neuronic activation values desired value as this group。Namely the object function obtaining linear topological decoder is:
Formula (10): J t o p o ( W , b : x , x ^ ) = J s p a r s e ( W , b : x , x ^ ) + 1 b S γ Σ V × S . ^ 2 + ϵ
Wherein V is the packet matrix of nL2 × nL2 size, and its building process is: for each of which group, i.e. every row vector, and first definition one is grouped an equal amount of labelling matrix F of selection matrix T based on topology。And wherein:
Formula (11): F i j ( t ) = { 1 i , j ∈ Sg ( t ) 0 o t h e r , t ∈ [ 0 , n L 2 - 1 ]
WhereinRepresent the value that in V, in the labelling matrix of t group, the i-th row jth arranges。Sg(t)It is the topology selection region of t group, and is:
Formula (12): Sg ( t ) = [ r S t : r S e , c S t : c S e ] , r S t = mod ( t / n L 2 + 0 , n L 2 ) r S e = mod ( t / n L 2 + s k , n L 2 ) c S t = mod ( mod ( t , n L 2 ) + 0 , n L 2 ) c S e = mod ( mod ( t , n L 2 ) + s k , n L 2 )
Wherein mod function is mod。Have hence for packet matrix:
Formula (13): V ( t , i × n L 2 + j ) = F i j ( t ) , t ∈ [ 0 , n L 2 - 1 ] , i ∈ [ 0 , n L 2 - 1 ] , j ∈ [ 0 , n L 2 - 1 ] ,
Namely as V, (r, when c)=1, represents that the c neuron belongs to r group;In formula (10), S is the matrix of nL2 × bS size of hidden layer neuron composition, and ε is to prevent singular value from opening the smoothing parameter of root;γ is the balance coefficient of topology regular terms and the important dependency of whole object function。
Process S22: the matrix of image block one nP × vS of composition of all frame of video in the training set that process S11 is obtained, wherein vS represents the neuron number of topology sparse linear decoder input layer, that is, vS=nL1=k × k × 3, are a number based on all pixels of RGB triple channel sliding sash;Model intermediate layer is hidden layer, is also after this model training, using the output valve of this layer as inputting characteristic of correspondence value。Because the nP × vS matrix constituted is excessive, so first this matrix being divided into multiple batches according to the size of bS × vS, adopting BP algorithm once to train one batch, the training of all of training data has once represented an epoch。Train multiple epoch to reach the purpose of model convergence。
It is preferred that use the weight parameter of linear topological decoder that substantial amounts of block without label image trains as the initial parameter of convolutional layer in convolutional neural networks model, lay the first stone for follow-up fine setting。It is as follows that it realizes process:
Process S31: the mode input layer making convolutional neural networks is video frame image, i.e. n × n × 3。For convolutional layer, same convolutional layer has multiple characteristic pattern, each characteristic pattern shares same convolution kernel, the receptive field size of each convolution kernel is k × k × 3, adopting the mode being entirely connected between convolutional layer with front layer, namely each characteristic pattern of convolutional layer can be associated with each characteristic pattern of front layer:
Formula (14): x j l = f ( Σ i ∈ M j x i l - 1 * w i j l + b j l )
WhereinRepresent the jth characteristic pattern of l layer;Represent the ith feature figure of l-1 layer;Represent the connection weight between jth characteristic pattern and the ith feature figure of l-1 layer of l layer;Represent the biasing of l layer jth characteristic pattern。
Process S32: the structure of the linear topological decoder trained by process S22 is nL1, nL2, nL3, wherein each neuron of the hidden layer of linear topological decoder is also full type of attachment with each neuron of input layer, as shown in formula (2) and formula (3), by one weight assignment between hidden unit with input layer of hidden layer in linear topological decoder to all of pixel, i.e. weighted value on convolution kernel on the pixel corresponding front layer receptive field on characteristic pattern each in the convolutional layer of convolutional neural networks。
With reference to Fig. 2, convolutional neural networks model is finely tuned by the mode adopting many times of cross validations by the training set of multiframe key frame composition in video training set, set up a generic features extractor based on video, finally the feature extracted in training set and test set is input in support vector machine and carries out the classification based on video semanteme, it is achieved process is as follows:
Process S41: adopt the mode of many times of cross validations, video set is divided into training set and test set, said process is to complete in all frame of video of training set, here first to all videos of training set to carry out choosing of frame of video every sF frame, using these frames key frame as this video, even the total mF of mf video(mf)Two field picture frame, then with 1:sF:mF(mf)The image key frame that video frame indicia is this video, and video classification y on the labelling of correspondence(mf), then all key frames of training set video are finely tuned as convolutional neural networks model data set。
Process S42: using the Softmax top layer model as convolutional neural networks model, by BP algorithm, convolutional neural networks model is finely tuned until restraining。Top layer Softmax layer is removed, it is thus achieved that about the generic features extractor of this sets of video data, and to make the output layer unit number of convolutional neural networks be nLo;
Process S43: carry out the feature extraction of convolutional neural networks model on the key frame of video of the process S41 training set obtained and test set, make the mf video have key frame to be mKF(mf)Frame, then each video obtains the eigenmatrix of mKF × nLo, and wherein row represents the number of key frame, and the feature extracted on the key frame of correspondence is shown in list。The row of this eigenmatrix is divided into pS part, then every part is the matrix of (mKF/pS) × nLo, the i.e. matrix of mKF/pS row nLo row, this matrix is carried out averaging with behavior axle, obtaining this part of upper length is the characteristic vector of nLo, by the characteristic vector of different piece being joined end to end, obtain characteristic vector that length the is nLo × pS characteristic vector as this video;
Process S44: aforementioned process respectively obtains eigenmatrix and the label matrix of training set and test set, puts into this characteristic in supporting vector machine model and carries out last semantic concept prediction。
Below in conjunction with concrete example, for on TRACVID2012 video to AirplaneFlying, Baby, Building, Car, Dog, Flower, InstrumentalMusician, Mountain, SceneText, Speech, this ten classes video carries out semantic analysis。
First, adopt the mode of many times of cross validations that video set is divided into training set and test set, and internal video sequences is carried out out of order process, prevent the context order contact between video with this;Then videos all in training set are carried out the separation of RGB color video frame image, and each image frame division is become the image block of 7 × 7 × 3 sizes, and out of order by carrying out between image block, generate an image block Disorder matrix。
Then, re-use technical scheme to build and sophisticated model。First according to abovementioned steps S1,
Adding topological property on sparse linear decoder basis and build linear topological decoder, input layer is 7 × 7 × 3 neurons, and intermediate layer is 400 neurons, and output layer is 7 × 7 × 3 neurons。Wherein λ=0.003 in formula (6), β=0.1 in formula (7), ρ=-0.095, γ=0.08 in formula (10)。Adopt BP algorithm that linear topological decoder is iterated training until restraining。Using the weight parameter as the ground floor convolutional layer of convolutional neural networks model of the weight parameter between the input layer of this linear topological decoder and intermediate layer;
Then, the video in training set is carried out the separation of RGB color video frame image, do not do the segmentation of image block, be directly inputted in one layer of convolutional neural networks that pre-training is good and exported, using this output training data as the next linear topological decoder of training;
As above obtain the convergence model of second linear topological decoder, using the parameter in the input layer of this model and intermediate layer as the initiation parameter of second layer convolutional neural networks, obtain the two-layer convolutional neural networks model that pre-training is good;Video sequences in training set is carried out out of order process, then carries out processing, according to abovementioned steps S4, the training set obtained for finely tuning convolutional neural networks and training obtains the characteristic vector of each video, put into and support vector machine carries out last prediction of result。
In order to evaluate and test and illustrate the performance that Video Semantic Analysis is detected by the method that the present invention adopts, the present invention adopts the most frequently used bat MAP (MeanAvg-Precision) as measurement index。Test video with identical method extraction key frame and is obtained characteristic vector, according to step S4, video is carried out semantic analysis detection。Respectively SIFT feature and the convolutional neural networks model of BoW word bag model, LBP feature and histogram model and random initializtion, adopt sparse linear decoder pre-training without topology convolutional neural networks model and with the present invention based on compared with the convolutional neural networks model method of topological model pre-training。Adopting 5 times of cross-validation methods, with control methods, same test video is carried out Video Semantic Analysis testing result as shown in table 1, wherein CNN represents convolutional neural networks;LD-CNN represents the convolutional neural networks model based on sparse linear decoder pre-training;TLD-CNN represents the convolutional neural networks model based on linear topological decoder pre-training。
Table 1 Video Semantic Analysis testing result
Being drawn by the data of table 1, under identical study mechanism, the result overall synthetic index obtained based on topological model pre-training convolutional neural networks provided by the present invention is superior to other several contrast models。And in the detection of each independent semanteme generally also superior to additive method。
In sum, Video Semantic Analysis method based on topological model pre-training convolutional Neural net provided by the present invention, the solution of the present invention first has the line decoder of topological property without supervised training and combines the method having supervision sample that the convolutional neural networks model of pre-training is finely tuned on a small quantity, solves the problem that convolutional neural networks model convergence rate is slow;And when introducing topological constraints characteristic, the model parameter acquired has more the video class data sample that reply content is changeable, improves accuracy and the robustness of model。
Although the present invention is disclosed above with preferred embodiment, so it is not limited to the present invention。Persond having ordinary knowledge in the technical field of the present invention, without departing from the spirit and scope of the present invention, when being used for a variety of modifications and variations。Therefore, protection scope of the present invention is when being as the criterion depending on those as defined in claim。

Claims (5)

1. a Video Semantic Analysis method, it is characterised in that comprise the following steps:
S1: video training set is carried out pretreatment, and builds sparse linear decoder;
S2: add topological property constraint on sparse linear decoder and obtain linear topological decoder, and video training set is carried out the fragmental image processing foundation training set based on image block thus training linear topological decoder;
S3: using the weight parameter of linear topological decoder that trains as the initial parameter of convolutional layer in convolutional neural networks;
S4: adopt the mode of many times of cross validations, and set up key frame set based on video training set convolutional neural networks is finely tuned, set up a generic features extractor based on video data, finally the feature extracted in training set and test set is input in support vector machine and carries out the classification based on video semanteme。
2. a kind of Video Semantic Analysis method according to claim 1, it is characterized in that: in described sparse linear decoder model construction process, first one line decoder model of definition, then weight decay and sparse regularization term are introduced on the mold, adjust the important dependency between this regular terms and whole object function by corresponding term coefficient, implement process as follows:
Process S11: make the number of videos m in video training set represent, wherein the total mF of mf video(mf)Two field picture frame, and the label of this video is y(mf);First all picture frames of this m video are extracted, and make each picture frame be sized to n × n × 3, the wherein width of n each image frame and height, 3 represent that what adopt is RGB color standard;Setting up sliding sash to be sized to k × k, sliding step is p, then slided by sliding sash, and piece image frame can extract altogetherIndividual image block, whole video training set can be extracted altogetherIndividual image block;Each image block is pulled into the vector x that length is k × k × 3, and out of order by carrying out between all of image block, and be bS training sample by every batch, it being divided into nbS=M/bS batch, the data set finally obtained is as the training set of training linear topological decoder;
Process S12: first define the model of line decoder, is input layer by ground floor, and the second layer is hidden layer, and third layer is that output layer is constituted, wherein every layer of neuron number respectively nL1, nL2, nL3, wherein nL1=nL3;Weight parameter between ground floor, the second interlayer and the second layer, third layer is respectivelyWithRepresent the neuronic bias of jth of weights and the n-th l+1 layer connected between the jth neuron of the n-th l+1 layer and the i-th neuron of the n-th l layer, nl ∈ { 1,2} respectively;The neuronic activation primitive of the second layer is formula (1):
a j ( 2 ) = f ( 2 ) ( z j ( 2 ) ) = e z j ( 2 ) - e - z j ( 2 ) e z j ( 2 ) + e - z j ( 2 ) - - - ( 1 )
For the neuronic output of second layer jth,For the neuronic input type of second layer jth (2):
z j ( 2 ) = Σ i = 1 n L 1 w j i ( 1 ) × a i ( 1 ) + b j ( 1 ) - - - ( 2 )
For each neuronic output of ground floor, be each element value of image block vector here, i.e. a(1)=x;The neuronic activation primitive of third layer is formula (3):
x ^ = a j ( 3 ) = f ( 3 ) ( z j ( 3 ) ) = z j ( 3 ) - - - ( 3 )
Namely the neuronic activation primitive of third layer is the second layer each neuronic linear combination such as formula (4)
z j ( 3 ) = Σ i = 1 n L 2 w j i ( 2 ) × a i ( 2 ) + b j ( 2 ) - - - ( 4 )
Obtain autocoder target function value such as formula (5)
J ( W , b : x , x ^ ) = 1 2 b S | | x ^ - x | | 2 - - - ( 5 )
WhereinThe output vector obtained it is input in this model for x;
Process S13: after setting up most basic line decoder, the over-fitting problem caused to prevent weight explosion phenomenon, object function increases weight attenuation term, obtains object function such as formula (6)
J w - d e c a y ( W , b : x , x ^ ) = 1 2 b S | | x ^ - x | | 2 + λ 2 Σ l = 1 N l - 1 Σ i = 1 s l Σ j = 1 s l + I ( w j i ( l ) ) 2 - - - ( 6 )
Wherein Nl is the number of plies of model, here Nl=3;SlIt it is the neuronic number of the n-th l layer;Sl+ 1 is the neuronic number of the n-th l+1 layer;λ is the balance coefficient of weight attenuation term and the important dependency of whole object function;
Process S14: on the basis of S13, this model is carried out the introducing of sparse characteristic, namely for the neuron of hidden layer, most neurons activation degree in each sample input process reaches inhibitory state close to-1, only the neuronic activation degree of small part is close to 1, thus extracting the openness feature of input data;Increasing sparse regular terms on object function, obtaining object function is:
J s p a r s e ( W , b : x , x ^ ) = J w - d e c a y ( W , b : x , x ^ ) + β Σ j = 1 s 2 L 1 ( ρ | | ρ ^ j ) - - - ( 7 )
That is, this sparse regularization term is to allow each neuronic average activation degree of hidden layerCan lower than certain value, wherein each neuronic average activation degree is:
ρ ^ j = 1 b S Σ i = 1 b S [ a j ( 2 ) ( x ( i ) ) ] - - - ( 8 )
Formula (8) represents on the basis of i-th input sample, the average of each neuronic activation value of hidden layer, and ρ is sparse term coefficient, is used for controlling the average value activating degree of hidden layer;Can close to set value by carrying out the activation degree of limited model hidden layer with L1 canonical formula:
L 1 ( ρ | | ρ ^ j ) = | | ρ ^ j - ρ | | - - - ( 9 ) .
3. a kind of Video Semantic Analysis method according to claim 1, it is characterized in that: described linear topological decoder is built upon on sparse linear decoder basis, by the neuronic activation situation of hidden layer is carried out topological constraints, this model is made to become a linear topological decoder, namely by the neuron of hidden layer is grouped in order, the neuron in same group is made to have similar activation degree, and the neuron of difference group is independent mutually, making this model can learn the topological property in data, it is as follows that it realizes process:
Process S21: after process S14, just obtains a sparse linear decoder;Process S14 is based on the basis of process S13, uses L1 canonical formula to be limited near certain value all of for hidden layer neuronic average activation degree。Here topology is by being first first grouped by all neurons of hidden layer。Namely for model, the second layer has nL2 neuron, then all of neuron is arranged in oneMatrix, it is designated as topology packet selection matrix T, in this matrix, the activation situation of any point all can be subject to centered by this point, neuronic impact in the scope of sk × sk size, namely centered by certain point, the conduct within the scope of periphery sk × sk one group, because hidden layer neuron one has nL2, so being divided into nL2 group altogether;
By using the quadratic sum of the same group of all neuronic activation values desired value as this group。Namely the object function obtaining linear topological decoder is:
J t o p o ( W , b : x , x ^ ) = J s p a r s e ( W , b : x , x ^ ) + 1 b S γ Σ V × S . ^ 2 + ϵ - - - ( 10 )
Wherein V is the packet matrix of nL2 × nL2 size, and building process is: for each of which group, i.e. every row vector, and first definition one is grouped an equal amount of labelling matrix F of selection matrix T based on topology;
F i j ( t ) = 1 i , j ∈ Sg ( t ) 0 o t h e r , t ∈ [ 0 , n L 2 - 1 ] - - - ( 11 )
Represent the value that in V, in the labelling matrix of t group, the i-th row jth arranges;Sg(t)It it is the topology selection region of t group
Sg ( t ) = [ r S t : r S e , c S t : c S e ] , r S t = mod ( t / n L 2 + 0 , n L 2 ) r S e = mod ( t / n L 2 + s k , n L 2 ) c S t = mod ( mod ( t , n L 2 ) + 0 , n L 2 ) c S e = m o d ( mod ( t , n L 2 ) + s k , n L 2 ) - - - ( 12 )
Mod function is mod;Have hence for packet matrix:
V ( t , i × n L 2 + j ) = F i j ( t ) - - - ( 13 ) ,
t∈[0,nL2-1], i ∈ [ 0 , n L 2 - 1 ] , j ∈ [ 0 , n L 2 - 1 ]
Namely as V, (r, when c)=1, represents that the c neuron belongs to r group;In formula (10), S is the matrix of nL2 × bS size of hidden layer neuron composition, and ε is to prevent singular value from opening the smoothing parameter of root;γ is the balance coefficient of topology regular terms and the important dependency of whole object function;
Process S22: constitute the matrix of a nP × vS by the image block of all frame of video in the training set that obtained by process S11, wherein vS represents the neuron number of topology sparse linear decoder input layer, namely, vS=nL1=k × k × 3, are a number based on all pixels of RGB triple channel sliding sash;Model intermediate layer is hidden layer, is also after this model training, using the output valve of this layer as inputting characteristic of correspondence value;Because the nP × vS matrix constituted is excessive, so first this matrix being divided into multiple batches according to the size of bS × vS, adopting BP algorithm once to train one batch, the training of all of training data has once represented an epoch;Train multiple epoch to reach the purpose of model convergence。
4. a kind of Video Semantic Analysis method according to claim 1, it is characterized in that, the weight parameter of the described linear topological decoder trained is as the initial parameter of convolutional neural networks, and have exemplar fine setting convolutional neural networks on a small quantity thus obtaining more excellent parameter by follow-up, implement process as follows:
Process S31: the mode input layer making convolutional neural networks is video frame image, i.e. n × n × 3;For convolutional layer, same convolutional layer has multiple characteristic pattern, each characteristic pattern shares same convolution kernel, the receptive field size of each convolution kernel is k × k × 3, adopting the mode being entirely connected between convolutional layer with front layer, namely each characteristic pattern of convolutional layer can be associated with each characteristic pattern of front layer:
x j l = f ( Σ i ∈ M j x i l - 1 * w i j l + b j l ) - - - ( 14 )
Represent the jth characteristic pattern of l layer;Represent the ith feature figure of l-1 layer;Represent the connection weight between jth characteristic pattern and the ith feature figure of l-1 layer of l layer;Represent the biasing of l layer jth characteristic pattern;
Process S32: the structure of the linear topological decoder trained by process S22 is nL1, nL2, nL3, wherein each neuron of the hidden layer of linear topological decoder is also full type of attachment with each neuron of input layer, as shown in formula (2) and formula (3), by one weight assignment between hidden unit with input layer of hidden layer in linear topological decoder to all of pixel, i.e. weighted value on convolution kernel on the pixel corresponding front layer receptive field on characteristic pattern each in the convolutional layer of convolutional neural networks。
5. a kind of Video Semantic Analysis method according to claim 1, it is characterized in that: the described generic features extractor based on video of setting up is to finely tune thus obtain by the new training set that the multiframe key frame in video training set forms to convolutional neural networks model by the mode of many times of cross validations, after obtaining this generic features extractor, the feature extracted in training set and test set is input in support vector machine and carries out the classification based on video semanteme, it is achieved process is as follows:
Process S41: adopt the mode of many times of cross validations, video set is divided into training set and test set, said process is to complete in all frame of video of training set, here first to all videos of training set to carry out choosing of frame of video every sF frame, using these frames key frame as this video, even the total mF of mf video(mf)Two field picture frame, then with 1:sF:mF(mf)The image key frame that video frame indicia is this video, and video classification y on the labelling of correspondence(mf), then all key frames of training set video are finely tuned as convolutional neural networks model data set;
Process S42: using the Softmax top layer model as convolutional neural networks model, by BP algorithm, convolutional neural networks model is finely tuned until restraining。Top layer Softmax layer is removed, it is thus achieved that about the generic features extractor of this sets of video data, and to make the output layer unit number of convolutional neural networks be nLo;
Process S43: carry out the feature extraction of convolutional neural networks model on the key frame of video of the process S41 training set obtained and test set, make the mf video have key frame to be mKF(mf)Frame, then each video obtains the eigenmatrix of mKF × nLo, and wherein row represents the number of key frame, and the feature extracted on the key frame of correspondence is shown in list。The row of this eigenmatrix is divided into pS part, then every part is the matrix of (mKF/pS) × nLo, the i.e. matrix of mKF/pS row nLo row, this matrix is carried out averaging with behavior axle, obtaining this part of upper length is the characteristic vector of nLo, by the characteristic vector of different piece being joined end to end, obtain characteristic vector that length the is nLo × pS characteristic vector as this video;
Process S44: aforementioned process respectively obtains eigenmatrix and the label matrix of training set and test set, puts into this characteristic in supporting vector machine model and carries out last semantic concept prediction。
CN201610107770.5A 2016-02-26 2016-02-26 A kind of Video Semantic Analysis method Active CN105701480B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610107770.5A CN105701480B (en) 2016-02-26 2016-02-26 A kind of Video Semantic Analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610107770.5A CN105701480B (en) 2016-02-26 2016-02-26 A kind of Video Semantic Analysis method

Publications (2)

Publication Number Publication Date
CN105701480A true CN105701480A (en) 2016-06-22
CN105701480B CN105701480B (en) 2019-02-01

Family

ID=56222546

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610107770.5A Active CN105701480B (en) 2016-02-26 2016-02-26 A kind of Video Semantic Analysis method

Country Status (1)

Country Link
CN (1) CN105701480B (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107025267A (en) * 2017-03-01 2017-08-08 国政通科技股份有限公司 Based on the method and system for extracting Video Key logical message retrieval video
CN107038221A (en) * 2017-03-22 2017-08-11 杭州电子科技大学 A kind of video content description method guided based on semantic information
CN107391646A (en) * 2017-07-13 2017-11-24 清华大学 A kind of Semantic features extraction method and device of video image
CN108664844A (en) * 2017-03-28 2018-10-16 爱唯秀股份有限公司 The image object semantics of convolution deep neural network identify and tracking
CN108665055A (en) * 2017-03-28 2018-10-16 上海荆虹电子科技有限公司 A kind of figure says generation method and device
CN108805036A (en) * 2018-05-22 2018-11-13 电子科技大学 A kind of new non-supervisory video semanteme extracting method
WO2018218481A1 (en) * 2017-05-31 2018-12-06 深圳市大疆创新科技有限公司 Neural network training method and device, computer system and mobile device
CN108960059A (en) * 2018-06-01 2018-12-07 众安信息技术服务有限公司 A kind of video actions recognition methods and device
CN109035488A (en) * 2018-08-07 2018-12-18 哈尔滨工业大学(威海) Aero-engine time series method for detecting abnormality based on CNN feature extraction
CN109784129A (en) * 2017-11-14 2019-05-21 北京京东尚科信息技术有限公司 Information output method and device
CN110738128A (en) * 2019-09-19 2020-01-31 天津大学 repeated video detection method based on deep learning
CN111565318A (en) * 2020-05-06 2020-08-21 中国科学院重庆绿色智能技术研究院 Video compression method based on sparse samples
CN111695422A (en) * 2020-05-06 2020-09-22 Oppo(重庆)智能科技有限公司 Video tag acquisition method and device, storage medium and server
CN112016513A (en) * 2020-09-08 2020-12-01 北京达佳互联信息技术有限公司 Video semantic segmentation method, model training method, related device and electronic equipment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104834941A (en) * 2015-05-19 2015-08-12 重庆大学 Offline handwriting recognition method of sparse autoencoder based on computer input

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104834941A (en) * 2015-05-19 2015-08-12 重庆大学 Offline handwriting recognition method of sparse autoencoder based on computer input

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HANLI GOH ETC.: ""Learning Invariant Color Features with Sparse Topographic Restricted Boltzmann Machines"", 《INTERNATIONAL CONFERENCE ON IMAGE PROCESSING》 *
詹永照等: ""基于非线性可鉴别的稀疏表示视频语义分析方法"", 《江苏大学学报(自然科学版)》 *
詹永照等: ""核可鉴别的特征分块稀疏表示的视频语义分析"", 《计算机辅助设计与图形学学报》 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107025267A (en) * 2017-03-01 2017-08-08 国政通科技股份有限公司 Based on the method and system for extracting Video Key logical message retrieval video
CN107038221A (en) * 2017-03-22 2017-08-11 杭州电子科技大学 A kind of video content description method guided based on semantic information
CN107038221B (en) * 2017-03-22 2020-11-17 杭州电子科技大学 Video content description method based on semantic information guidance
CN108664844A (en) * 2017-03-28 2018-10-16 爱唯秀股份有限公司 The image object semantics of convolution deep neural network identify and tracking
CN108665055A (en) * 2017-03-28 2018-10-16 上海荆虹电子科技有限公司 A kind of figure says generation method and device
CN108665055B (en) * 2017-03-28 2020-10-23 深圳荆虹科技有限公司 Method and device for generating graphic description
WO2018218481A1 (en) * 2017-05-31 2018-12-06 深圳市大疆创新科技有限公司 Neural network training method and device, computer system and mobile device
CN107391646B (en) * 2017-07-13 2020-04-10 清华大学 Semantic information extraction method and device for video image
CN107391646A (en) * 2017-07-13 2017-11-24 清华大学 A kind of Semantic features extraction method and device of video image
CN109784129A (en) * 2017-11-14 2019-05-21 北京京东尚科信息技术有限公司 Information output method and device
CN108805036A (en) * 2018-05-22 2018-11-13 电子科技大学 A kind of new non-supervisory video semanteme extracting method
CN108805036B (en) * 2018-05-22 2022-11-22 电子科技大学 Unsupervised video semantic extraction method
CN108960059A (en) * 2018-06-01 2018-12-07 众安信息技术服务有限公司 A kind of video actions recognition methods and device
CN109035488A (en) * 2018-08-07 2018-12-18 哈尔滨工业大学(威海) Aero-engine time series method for detecting abnormality based on CNN feature extraction
CN110738128A (en) * 2019-09-19 2020-01-31 天津大学 repeated video detection method based on deep learning
CN111565318A (en) * 2020-05-06 2020-08-21 中国科学院重庆绿色智能技术研究院 Video compression method based on sparse samples
CN111695422A (en) * 2020-05-06 2020-09-22 Oppo(重庆)智能科技有限公司 Video tag acquisition method and device, storage medium and server
CN111695422B (en) * 2020-05-06 2023-08-18 Oppo(重庆)智能科技有限公司 Video tag acquisition method and device, storage medium and server
CN112016513A (en) * 2020-09-08 2020-12-01 北京达佳互联信息技术有限公司 Video semantic segmentation method, model training method, related device and electronic equipment
CN112016513B (en) * 2020-09-08 2024-01-30 北京达佳互联信息技术有限公司 Video semantic segmentation method, model training method, related device and electronic equipment

Also Published As

Publication number Publication date
CN105701480B (en) 2019-02-01

Similar Documents

Publication Publication Date Title
CN105701480A (en) Video semantic analysis method
US11055549B2 (en) Network, system and method for image processing
CN111783782B (en) Remote sensing image semantic segmentation method fusing and improving UNet and SegNet
CN103984959B (en) A kind of image classification method based on data and task-driven
CN107526785B (en) Text classification method and device
CN104850890B (en) Instance-based learning and the convolutional neural networks parameter regulation means of Sadowsky distributions
CN109376242B (en) Text classification method based on cyclic neural network variant and convolutional neural network
CN113159051B (en) Remote sensing image lightweight semantic segmentation method based on edge decoupling
CN104217214B (en) RGB D personage's Activity recognition methods based on configurable convolutional neural networks
Audebert et al. Generative adversarial networks for realistic synthesis of hyperspectral samples
CN107229914B (en) Handwritten digit recognition method based on deep Q learning strategy
CN106991372A (en) A kind of dynamic gesture identification method based on interacting depth learning model
CN107862261A (en) Image people counting method based on multiple dimensioned convolutional neural networks
CN106682694A (en) Sensitive image identification method based on depth learning
CN110321361B (en) Test question recommendation and judgment method based on improved LSTM neural network model
CN107871136A (en) The image-recognizing method of convolutional neural networks based on openness random pool
CN106919951A (en) A kind of Weakly supervised bilinearity deep learning method merged with vision based on click
CN105069825A (en) Image super resolution reconstruction method based on deep belief network
CN106991382A (en) A kind of remote sensing scene classification method
CN105205448A (en) Character recognition model training method based on deep learning and recognition method thereof
CN109741341A (en) A kind of image partition method based on super-pixel and long memory network in short-term
CN111160553B (en) Novel field self-adaptive learning method
CN112085738B (en) Image segmentation method based on generation countermeasure network
CN113128620B (en) Semi-supervised domain self-adaptive picture classification method based on hierarchical relationship
CN111639719A (en) Footprint image retrieval method based on space-time motion and feature fusion

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant