CN105701480A

CN105701480A - Video semantic analysis method

Info

Publication number: CN105701480A
Application number: CN201610107770.5A
Authority: CN
Inventors: 詹永照; 詹智财; 张建明; 彭长生
Original assignee: JIANGSU KING INTELLIGENT SYSTEM CO Ltd; Jiangsu University
Current assignee: JIANGSU KING INTELLIGENT SYSTEM CO Ltd; Jiangsu University
Priority date: 2016-02-26
Filing date: 2016-02-26
Publication date: 2016-06-22
Anticipated expiration: 2036-02-26
Also published as: CN105701480B

Abstract

The present invention provides a video semantic analysis method. The method comprises the following steps: S1, performing preprocessing of a video training set, and constructing a sparse linear decoder; S2, adding topological property constraints to build a topology linear decoder, and performing image blocking processing of the video training set to train the topology linear decoder; S3, taking the parameters of a trained topology linear decoder as initial parameters of a convolution layer in a convolution nerve network; and S4, building a key frame set to perform fin adjustment of the convolution nerve network based on the video training set through adoption of a multifold cross validation mode, and building a general feature extractor on the base of the video data, and inputting the features extracted on the training set and a test set to support the video semantic classification in a vector machine. The model training method provided by the invention has video-type data samples capable of responding to various contents so as to improve the accuracy and the robustness of the model.

Description

A kind of Video Semantic Analysis method

Technical field

The present invention relates to video semanteme detection technique field, in particular to a kind of Video Semantic Analysis method。

Background technology

In order to realize the detection of video semantic classification, employ the method that the key frame set to video of the convolutional neural networks model carries out feature extraction, experiment proves to be different from the extracting mode of other manual designs feature, convolutional neural networks model itself is go out distributed nature from extracting data, and what namely obtain being characterized by data-driven form enables adaptation to wider array of field。But convolutional neural networks is supervised learning model, namely when convolutional neural networks model is trained, need training dataset, it is also required to the label that training dataset is corresponding, and the convergence of convolutional neural networks is also required to the continuous iteration of substantial amounts of sample, this is for the task such as classification and Detection of the video data of magnanimity, it is impossible to obtain the label that each video is corresponding。

It is directed on video data and adopts the convolutional neural networks model with Training characteristic, although forefathers, based on without the method proposed on the basis of supervised training without supervision pre-training, solve the problem that traditional convolutional neural networks convergence is slow；And it is compared to image data, video data can have the rotation of same target in terms of content, convergent-divergent, the phenomenons such as translation, this feature extractor being accomplished by using can capture the feature of more complicated invariance, so how well extracts the feature with stronger invariance and has become required solution problem。

Summary of the invention

Present invention aim at providing a kind of Video Semantic Analysis method, by combining without the advantage of supervision pre-training method and topological property so that convolutional neural networks can use less than ever has an exemplar, and can speed up and converge to stationary value。And the introducing based on topological property so that model can extract has the translation of higher reply target, object convergent-divergent, the feature that object rotates, and improves accuracy and robustness that semantic analysis is detected by model。

In order to solve above technical problem, the concrete technical scheme that the present invention adopts is as follows:

A kind of Video Semantic Analysis method, it is characterised in that comprise the following steps:

S1: video training set is carried out pretreatment, and builds sparse linear decoder；

S2: add topological property constraint on sparse linear decoder and obtain linear topological decoder, and video training set is carried out the fragmental image processing foundation training set based on image block thus training linear topological decoder；

S3: using the weight parameter of linear topological decoder that trains as the initial parameter of convolutional layer in convolutional neural networks；

S4: adopt the mode of many times of cross validations, and set up key frame set based on video training set convolutional neural networks is finely tuned, set up a generic features extractor based on video data, finally the feature extracted in training set and test set is input in support vector machine and carries out the classification based on video semanteme。

In described sparse linear decoder model construction process, first one line decoder model of definition, then weight decay and sparse regularization term are introduced on the mold, adjust the important dependency between this regular terms and whole object function by corresponding term coefficient, implement process as follows:

Process S11: make the number of videos m in video training set represent, wherein the total mF of mf video^(mf)Two field picture frame, and the label of this video is y^(mf)；First all picture frames of this m video are extracted, and make each picture frame be sized to n × n × 3, the wherein width of n each image frame and height, 3 represent that what adopt is RGB color standard；Setting up sliding sash to be sized to k × k, sliding step is p, then slided by sliding sash, and piece image frame can extract altogetherIndividual image block, whole video training set can be extracted altogetherIndividual image block；Each image block is pulled into the vector x that length is k × k × 3, and out of order by carrying out between all of image block, and be bS training sample by every batch, it being divided into nbS=M/bS batch, the data set finally obtained is as the training set of training linear topological decoder；

Process S12: first define the model of line decoder, is input layer by ground floor, and the second layer is hidden layer, and third layer is that output layer is constituted, wherein every layer of neuron number respectively nL1, nL2, nL3, wherein nL1=nL3；Weight parameter between ground floor, the second interlayer and the second layer, third layer is respectivelyWithRepresent the neuronic bias of jth of weights and the n-th l+1 layer connected between the jth neuron of the n-th l+1 layer and the i-th neuron of the n-th l layer, nl ∈ { 1,2} respectively；The neuronic activation primitive of the second layer is formula (1):

a_{j}^{(2)} = f^{(2)} (z_{j}^{(2)}) = \frac{e^{z_{j}^{(2)}} - e^{- z_{j}^{(2)}}}{e^{z_{j}^{(2)}} + e^{- z_{j}^{(2)}}} - - - (1)

For the neuronic output of second layer jth,For the neuronic input type of second layer jth (2):

z_{j}^{(2)} = Σ_{i = 1}^{n L 1} w_{j i}^{(1)} + a_{i}^{(1)} + b_{j}^{(1)} - - - (2)

For each neuronic output of ground floor, be each element value of image block vector here, i.e. a⁽¹⁾=x；The neuronic activation primitive of third layer is formula (3):

\hat{x} = a_{j}^{(3)} = f^{(3)} (z_{j}^{(3)}) = z_{j}^{(3)} - - - (3)

Namely the neuronic activation primitive of third layer is the second layer each neuronic linear combination such as formula (4)

z_{j}^{(3)} = Σ_{i = 1}^{n L 2} w_{j i}^{(2)} \times a_{i}^{(2)} + b_{j}^{(2)} - - - (4)

Obtain autocoder target function value such as formula (5)

J (W, b : x, \hat{x}) = \frac{1}{2 b S} | | \hat{x} - x | |^{2} - - - (5)

WhereinThe output vector obtained it is input in this model for x；

Process S13: after setting up most basic line decoder, the over-fitting problem caused to prevent weight explosion phenomenon, object function increases weight attenuation term, obtains object function such as formula (6)

J_{w - d e c a y} (W, b : x, \hat{x}) = \frac{1}{2 b S} | | \hat{x} - x | |^{2} + \frac{λ}{2} Σ_{l = 1}^{N l - 1} Σ_{i = 1}^{s_{l}} Σ_{j = 1}^{s_{l + 1}} {(w_{j i}^{(l)})}^{2} - - - (6)

Wherein Nl is the number of plies of model, here Nl=3；S_lIt it is the neuronic number of the n-th l layer；S_l+ 1 is the neuronic number of the n-th l+1 layer；λ is the balance coefficient of weight attenuation term and the important dependency of whole object function；

Process S14: on the basis of S13, this model is carried out the introducing of sparse characteristic, namely for the neuron of hidden layer, most neurons activation degree in each sample input process reaches inhibitory state close to-1, only the neuronic activation degree of small part is close to 1, thus extracting the openness feature of input data；Increasing sparse regular terms on object function, obtaining object function is:

J_{s p a r s e} (W, b : x, \hat{x}) = J_{w - d e c a y} (W, b : x, \hat{x}) + β Σ_{j = 1}^{s_{2}} L 1 (ρ | | {\hat{ρ}}_{j}) - - - (7)

That is, this sparse regularization term is to allow each neuronic average activation degree of hidden layerCan lower than certain value, wherein each neuronic average activation degree is:

{\hat{ρ}}_{j} = \frac{1}{b S} Σ_{i = 1}^{b S} [a_{j}^{(2)} (x^{(i)})] - - - (8)

Formula (8) represents on the basis of i-th input sample, the average of each neuronic activation value of hidden layer, and ρ is sparse term coefficient, is used for controlling the average value activating degree of hidden layer；Can close to set value by carrying out the activation degree of limited model hidden layer with L1 canonical formula:

L 1 (ρ | | {\hat{ρ}}_{j}) = | | {\hat{ρ}}_{j} - ρ | | - - - (9) .

Described linear topological decoder is built upon on sparse linear decoder basis, by the neuronic activation situation of hidden layer is carried out topological constraints, this model is made to become a linear topological decoder, namely by the neuron of hidden layer is grouped in order, the neuron in same group is made to have similar activation degree, and the neuron of difference group is independent mutually so that this model can learn the topological property in data, and it is as follows that it realizes process:

Process S21: after process S14, just obtains a sparse linear decoder；Process S14 is based on the basis of process S13, uses L1 canonical formula to be limited near certain value all of for hidden layer neuronic average activation degree。Here topology is by being first first grouped by all neurons of hidden layer。Namely for model, the second layer has nL2 neuron, then all of neuron is arranged in oneMatrix, it is designated as topology packet selection matrix T, in this matrix, the activation situation of any point all can be subject to centered by this point, neuronic impact in the scope of sk × sk size, namely centered by certain point, the conduct within the scope of periphery sk × sk one group, because hidden layer neuron one has nL2, so being divided into nL2 group altogether；

By using the quadratic sum of the same group of all neuronic activation values desired value as this group。Namely the object function obtaining linear topological decoder is:

J_{t o p o} (W, b : x, \hat{x}) = J_{s p a r s e} (W, b : x, \hat{x}) + \frac{1}{b S} γ Σ \sqrt{V \times S .^2 + ϵ} - - - (10)

Wherein V is the packet matrix of nL2 × nL2 size, and building process is: for each of which group, i.e. every row vector, and first definition one is grouped an equal amount of labelling matrix F of selection matrix T based on topology；

F_{i j}^{(t)} = {\begin{matrix} 1 & i, j &Element; {Sg}^{(t)} \\ 0 & o t h e r \end{matrix}, t &Element; [0, n L 2 - 1] - - - (11)

Represent the value that in V, in the labelling matrix of t group, the i-th row jth arranges；Sg^(t)It it is the topology selection region of t group

{Sg}^{(t)} = [r S t : r S e, c S t : c S e], \{\begin{matrix} r S t = \mod (t / \sqrt{n L 2} + 0, \sqrt{n L 2}) \\ r S e = \mod (t / \sqrt{n L 2} + s k, \sqrt{n L 2}) \\ c S t = \mod (\mod (t, \sqrt{n L 2}) + 0, \sqrt{n L 2}) \\ c S e = \mod (\mod (t, \sqrt{n L 2}) + s k, \sqrt{n L 2}) \end{matrix} - - - (12)

Mod function is mod；Have hence for packet matrix:

V_{(t, i \times \sqrt{n L 2} + j)} = F_{i j}^{(t)} - - - (13),

t∈[0,nL2-1],

i &Element; [0, \sqrt{n L 2} - 1], j &Element; [0, \sqrt{n L 2} - 1]

Namely as V, (r, when c)=1, represents that the c neuron belongs to r group；In formula (10), S is the matrix of nL2 × bS size of hidden layer neuron composition, and ε is to prevent singular value from opening the smoothing parameter of root；γ is the balance coefficient of topology regular terms and the important dependency of whole object function；

Process S22: constitute the matrix of a nP × vS by the image block of all frame of video in the training set that obtained by process S11, wherein vS represents the neuron number of topology sparse linear decoder input layer, namely, vS=nL1=k × k × 3, are a number based on all pixels of RGB triple channel sliding sash；Model intermediate layer is hidden layer, is also after this model training, using the output valve of this layer as inputting characteristic of correspondence value；Because the nP × vS matrix constituted is excessive, so first this matrix being divided into multiple batches according to the size of bS × vS, adopting BP algorithm once to train one batch, the training of all of training data has once represented an epoch；Train multiple epoch to reach the purpose of model convergence。

The weight parameter of the described linear topological decoder trained is as the initial parameter of convolutional neural networks, and has exemplar fine setting convolutional neural networks on a small quantity thus obtaining more excellent parameter by follow-up, implements process as follows:

Process S31: the mode input layer making convolutional neural networks is video frame image, i.e. n × n × 3；For convolutional layer, same convolutional layer has multiple characteristic pattern, each characteristic pattern shares same convolution kernel, the receptive field size of each convolution kernel is k × k × 3, adopting the mode being entirely connected between convolutional layer with front layer, namely each characteristic pattern of convolutional layer can be associated with each characteristic pattern of front layer:

x_{j}^{l} = f (\underset{i &Element; M_{j}}{Σ} x_{i}^{l - 1} * w_{i j}^{l} + b_{j}^{l}) - - - (14)

Represent the jth characteristic pattern of l layer；Represent the ith feature figure of l-1 layer；Represent the connection weight between jth characteristic pattern and the ith feature figure of l-1 layer of l layer；Represent the biasing of l layer jth characteristic pattern；

Process S32: the structure of the linear topological decoder trained by process S22 is nL1, nL2, nL3, wherein each neuron of the hidden layer of linear topological decoder is also full type of attachment with each neuron of input layer, as shown in formula (2) and formula (3), by one weight assignment between hidden unit with input layer of hidden layer in linear topological decoder to all of pixel, i.e. weighted value on convolution kernel on the pixel corresponding front layer receptive field on characteristic pattern each in the convolutional layer of convolutional neural networks。

The described generic features extractor based on video of setting up is to finely tune thus obtain by the new training set that the multiframe key frame in video training set forms to convolutional neural networks model by the mode of many times of cross validations, after obtaining this generic features extractor, the feature extracted in training set and test set is input in support vector machine and carries out the classification based on video semanteme, it is achieved process is as follows:

Process S41: adopt the mode of many times of cross validations, video set is divided into training set and test set, said process is to complete in all frame of video of training set, here first to all videos of training set to carry out choosing of frame of video every sF frame, using these frames key frame as this video, even the total mF of mf video^(mf)Two field picture frame, then with 1:sF:mF^(mf)The image key frame that video frame indicia is this video, and video classification y on the labelling of correspondence^(mf), then all key frames of training set video are finely tuned as convolutional neural networks model data set；

Process S42: using the Softmax top layer model as convolutional neural networks model, by BP algorithm, convolutional neural networks model is finely tuned until restraining。Top layer Softmax layer is removed, it is thus achieved that about the generic features extractor of this sets of video data, and to make the output layer unit number of convolutional neural networks be nLo；

Process S43: carry out the feature extraction of convolutional neural networks model on the key frame of video of the process S41 training set obtained and test set, make the mf video have key frame to be mKF^(mf)Frame, then each video obtains the eigenmatrix of mKF × nLo, and wherein row represents the number of key frame, and the feature extracted on the key frame of correspondence is shown in list。The row of this eigenmatrix is divided into pS part, then every part is the matrix of (mKF/pS) × nLo, the i.e. matrix of mKF/pS row nLo row, this matrix is carried out averaging with behavior axle, obtaining this part of upper length is the characteristic vector of nLo, by the characteristic vector of different piece being joined end to end, obtain characteristic vector that length the is nLo × pS characteristic vector as this video；

Process S44: aforementioned process respectively obtains eigenmatrix and the label matrix of training set and test set, puts into this characteristic in supporting vector machine model and carries out last semantic concept prediction。

The present invention has beneficial effect。The present invention is by combining topological property with without supervision pre-training learning method, having on the basis of supervision sample on a small quantity, overcome convolutional neural networks training need and have exemplar in a large number, and restrain slower problem, and the accuracy of this model and robustness are higher than the model not using pre-training, and it is more suitable for the target translation of video data, object convergent-divergent, the characteristic such as object rotation。When adopting the feature based on the convolutional neural networks model extraction of topological model pre-training that video semanteme is analyzed, it is effectively improved the model accuracy to Video Semantic Analysis。

Accompanying drawing explanation

Fig. 1 is the structure flow chart of linear topological decoder。

Fig. 2 is the schematic flow sheet carrying out Video Semantic Analysis detection。

Fig. 3 is the schematic diagram of linear topological decoder。

Detailed description of the invention

Below in conjunction with the drawings and specific embodiments, technical scheme is described in further details。

With reference to shown in Fig. 1 and Fig. 2, the preferred embodiment according to the present invention, the Video Semantic Analysis method based on topological model pre-training convolutional neural networks comprises the following steps S1: video training set carries out pretreatment, and builds sparse linear decoder；S2: add topological property constraint building topology line decoder, and video training set is carried out fragmental image processing thus training linear topological decoder；S3: using the parameter of linear topological decoder that trains as the initial parameter of convolutional layer in convolutional neural networks；S4: adopt the mode of many times of cross validations and set up key frame set based on video training set convolutional neural networks is finely tuned, set up a generic features extractor based on video data, finally the feature extracted in training set and test set is input in support vector machine and carries out the classification based on video semanteme。

With reference to Fig. 1, Fig. 3, in the construction process of afore-mentioned topology line decoder, first one line decoder model of definition, then introduce topology regularization term on the mold, adjusting the important dependency between this regular terms and whole object function by corresponding term coefficient, it is as follows that it realizes process:

Process S11: make the number of videos m in video training set represent, wherein the total mF of mf video^(mf)Two field picture frame, and the label of this video is y^(mf)。First all picture frames of this m video are extracted, and make each picture frame be sized to n × n × 3, the wherein width of n each image frame and height, 3 represent that what adopt is RGB color standard。Setting up sliding sash to be sized to k*k, sliding step is p, then slided by sliding sash, and piece image frame can extract altogetherIndividual image block, whole video training set can be extracted altogetherIndividual image block。Each image block is pulled into the vector x that length is k × k × 3, and out of order by carrying out between all of image block, and be bS training sample by every batch, it being divided into nbS=M/bS batch, the data set finally obtained is as the training set of training linear topological decoder；

Process S12: first define the model of line decoder, this model is input layer by ground floor, and the second layer is hidden layer, and third layer is that output layer is constituted, wherein every layer of neuron number respectively nL1, nL2, nL3, wherein nL1=nL3。The weight parameter of first and second interlayer and second and third interlayer is respectivelyWithRepresent the neuronic bias of jth of weights and the n-th l+1 layer connected between the jth neuron of the n-th l+1 layer and the i-th neuron of the n-th l layer, wherein nl ∈ { 1,2} respectively。The neuronic activation primitive of the second layer is:

Formula (1):

a_{j}^{(2)} = f^{(2)} (z_{j}^{(2)}) = \frac{e^{z_{j}^{(2)}} - e^{- z_{j}^{(2)}}}{e^{z_{j}^{(2)}} + e^{- z_{j}^{(2)}}}

WhereinFor the neuronic output of second layer jth,For the neuronic input of second layer jth:

Formula (2):

z_{j}^{(2)} = Σ_{i = 1}^{n L 1} w_{j i}^{(1)} \times a_{i}^{(1)} + b_{j}^{(1)}

WhereinFor each neuronic output of ground floor, be each element value of image block vector here, i.e. a⁽¹⁾=x。The neuronic activation primitive of third layer is:

Formula (3):

\hat{x} = a_{j}^{(3)} = f^{(3)} (z_{j}^{(3)}) = z_{j}^{(3)}

Namely the neuronic activation primitive of third layer is each neuronic linear combination of the second layer:

Formula (4):

z_{j}^{(3)} = Σ_{i = 1}^{n L 2} w_{j i}^{2} \times a_{i}^{(2)} + b_{j}^{(2)}

Obtain autocoder target function value:

Formula (5):

J (W, b : x, \hat{x}) = \frac{1}{2 b S} | | \hat{x} - x | |^{2}

WhereinThe output vector obtained it is input in this model for x。

Process S13: after setting up most basic line decoder, the over-fitting problem caused to prevent weight explosion phenomenon, object function increases weight attenuation term, obtaining object function is:

Formula (6):

J_{w - d e c a y} (W, b : x, \hat{x}) = \frac{1}{2 b S} | | \hat{x} - x | |^{2} + \frac{λ}{2} Σ_{l = 1}^{N l - 1} Σ_{i = 1}^{s_{l}} Σ_{j = 1}^{s_{l + 1}} {(w_{j i}^{(l)})}^{2}

Wherein Nl is the number of plies of model, here Nl=3；S_lIt it is the neuronic number of the n-th l layer；S_l+ 1 is the neuronic number of the n-th l+1 layer；λ is the balance coefficient of weight attenuation term and the important dependency of whole object function。

Process S14: on the basis of S13, this model is carried out the introducing of sparse characteristic, namely for the neuron of hidden layer, most neurons activation degree in each sample input process reaches inhibitory state close to-1, only the neuronic activation degree of small part is close to 1, thus extracting the openness feature of input data。Increasing sparse regular terms on object function, obtaining object function is:

Formula (7):

J_{s p a r s e} (W, b : x, \hat{x}) = J_{w - d e c a y} (W, b : x, \hat{x}) + β Σ_{j = 1}^{s_{2}} L 1 (ρ | | {\hat{ρ}}_{j})

Namely so that for each neuronic average activation degree of hidden layerCan lower than certain value:

Formula (8):

{\hat{ρ}}_{j} = \frac{1}{b S} Σ_{i = 1}^{b S} [a_{j}^{(2)} (x^{(i)})]

Input on the basis of sample shown herein as i-th, the average of each neuronic activation value of hidden layer, and ρ is sparse term coefficient, is used for controlling the average value activating degree of hidden layer。Can close to set value by carrying out the activation degree of limited model hidden layer with L1 canonical formula:

Formula (9):

L 1 (ρ | | {\hat{ρ}}_{j}) = | | {\hat{ρ}}_{j} - ρ | |

In the present embodiment, with reference to shown in Fig. 1, it is preferred that on the sparse linear decoder basis established, by the neuronic activation situation of hidden layer is carried out topological constraints, this model is made to become a linear topological decoder, namely by the neuron of hidden layer is grouped in order so that the neuron in same group has similar activation degree, and the neuron of different group is independent mutually, making this model can learn the topological property in data, it is as follows that it realizes process:

Process S21: after step S14, just obtains a sparse linear decoder。Process S14 is based on the basis of process S13, uses L1 canonical formula to be limited near certain value all of for hidden layer neuronic average activation degree。Here topology is by being first first grouped by all neurons of hidden layer。Namely for model, the second layer has nL2 neuron, then all of neuron is arranged in oneMatrix, it is designated as topology packet selection matrix T, in this matrix, the activation situation of any point all can be subject to centered by this point, neuronic impact in the scope of sk × sk size, namely centered by certain point, the conduct within the scope of periphery sk × sk one group, because hidden layer neuron one has nL2, so being divided into nL2 group altogether。By using the quadratic sum of the same group of all neuronic activation values desired value as this group。Namely the object function obtaining linear topological decoder is:

Formula (10):

J_{t o p o} (W, b : x, \hat{x}) = J_{s p a r s e} (W, b : x, \hat{x}) + \frac{1}{b S} γ Σ \sqrt{V \times S .^2 + ϵ}

Wherein V is the packet matrix of nL2 × nL2 size, and its building process is: for each of which group, i.e. every row vector, and first definition one is grouped an equal amount of labelling matrix F of selection matrix T based on topology。And wherein:

Formula (11):

F_{i j}^{(t)} = {\begin{matrix} 1 & i, j &Element; {Sg}^{(t)} \\ 0 & o t h e r \end{matrix}, t &Element; [0, n L 2 - 1]

WhereinRepresent the value that in V, in the labelling matrix of t group, the i-th row jth arranges。Sg^(t)It is the topology selection region of t group, and is:

Formula (12):

{Sg}^{(t)} = [r S t : r S e, c S t : c S e], \{\begin{matrix} r S t = \mod (t / \sqrt{n L 2} + 0, \sqrt{n L 2}) \\ r S e = \mod (t / \sqrt{n L 2} + s k, \sqrt{n L 2}) \\ c S t = \mod (\mod (t, \sqrt{n L 2}) + 0, \sqrt{n L 2}) \\ c S e = \mod (\mod (t, \sqrt{n L 2}) + s k, \sqrt{n L 2}) \end{matrix}

Wherein mod function is mod。Have hence for packet matrix:

Formula (13):

V_{(t, i \times \sqrt{n L 2} + j)} = F_{i j}^{(t)}, t &Element; [0, n L 2 - 1], i &Element; [0, \sqrt{n L 2} - 1], j &Element; [0, \sqrt{n L 2} - 1],

Namely as V, (r, when c)=1, represents that the c neuron belongs to r group；In formula (10), S is the matrix of nL2 × bS size of hidden layer neuron composition, and ε is to prevent singular value from opening the smoothing parameter of root；γ is the balance coefficient of topology regular terms and the important dependency of whole object function。

Process S22: the matrix of image block one nP × vS of composition of all frame of video in the training set that process S11 is obtained, wherein vS represents the neuron number of topology sparse linear decoder input layer, that is, vS=nL1=k × k × 3, are a number based on all pixels of RGB triple channel sliding sash；Model intermediate layer is hidden layer, is also after this model training, using the output valve of this layer as inputting characteristic of correspondence value。Because the nP × vS matrix constituted is excessive, so first this matrix being divided into multiple batches according to the size of bS × vS, adopting BP algorithm once to train one batch, the training of all of training data has once represented an epoch。Train multiple epoch to reach the purpose of model convergence。

It is preferred that use the weight parameter of linear topological decoder that substantial amounts of block without label image trains as the initial parameter of convolutional layer in convolutional neural networks model, lay the first stone for follow-up fine setting。It is as follows that it realizes process:

Process S31: the mode input layer making convolutional neural networks is video frame image, i.e. n × n × 3。For convolutional layer, same convolutional layer has multiple characteristic pattern, each characteristic pattern shares same convolution kernel, the receptive field size of each convolution kernel is k × k × 3, adopting the mode being entirely connected between convolutional layer with front layer, namely each characteristic pattern of convolutional layer can be associated with each characteristic pattern of front layer:

Formula (14):

x_{j}^{l} = f (\underset{i &Element; M_{j}}{Σ} x_{i}^{l - 1} * w_{i j}^{l} + b_{j}^{l})

WhereinRepresent the jth characteristic pattern of l layer；Represent the ith feature figure of l-1 layer；Represent the connection weight between jth characteristic pattern and the ith feature figure of l-1 layer of l layer；Represent the biasing of l layer jth characteristic pattern。

With reference to Fig. 2, convolutional neural networks model is finely tuned by the mode adopting many times of cross validations by the training set of multiframe key frame composition in video training set, set up a generic features extractor based on video, finally the feature extracted in training set and test set is input in support vector machine and carries out the classification based on video semanteme, it is achieved process is as follows:

Process S41: adopt the mode of many times of cross validations, video set is divided into training set and test set, said process is to complete in all frame of video of training set, here first to all videos of training set to carry out choosing of frame of video every sF frame, using these frames key frame as this video, even the total mF of mf video^(mf)Two field picture frame, then with 1:sF:mF^(mf)The image key frame that video frame indicia is this video, and video classification y on the labelling of correspondence^(mf), then all key frames of training set video are finely tuned as convolutional neural networks model data set。

Below in conjunction with concrete example, for on TRACVID2012 video to AirplaneFlying, Baby, Building, Car, Dog, Flower, InstrumentalMusician, Mountain, SceneText, Speech, this ten classes video carries out semantic analysis。

First, adopt the mode of many times of cross validations that video set is divided into training set and test set, and internal video sequences is carried out out of order process, prevent the context order contact between video with this；Then videos all in training set are carried out the separation of RGB color video frame image, and each image frame division is become the image block of 7 × 7 × 3 sizes, and out of order by carrying out between image block, generate an image block Disorder matrix。

Then, re-use technical scheme to build and sophisticated model。First according to abovementioned steps S1,

Adding topological property on sparse linear decoder basis and build linear topological decoder, input layer is 7 × 7 × 3 neurons, and intermediate layer is 400 neurons, and output layer is 7 × 7 × 3 neurons。Wherein λ=0.003 in formula (6), β=0.1 in formula (7), ρ=-0.095, γ=0.08 in formula (10)。Adopt BP algorithm that linear topological decoder is iterated training until restraining。Using the weight parameter as the ground floor convolutional layer of convolutional neural networks model of the weight parameter between the input layer of this linear topological decoder and intermediate layer；

Then, the video in training set is carried out the separation of RGB color video frame image, do not do the segmentation of image block, be directly inputted in one layer of convolutional neural networks that pre-training is good and exported, using this output training data as the next linear topological decoder of training；

As above obtain the convergence model of second linear topological decoder, using the parameter in the input layer of this model and intermediate layer as the initiation parameter of second layer convolutional neural networks, obtain the two-layer convolutional neural networks model that pre-training is good；Video sequences in training set is carried out out of order process, then carries out processing, according to abovementioned steps S4, the training set obtained for finely tuning convolutional neural networks and training obtains the characteristic vector of each video, put into and support vector machine carries out last prediction of result。

In order to evaluate and test and illustrate the performance that Video Semantic Analysis is detected by the method that the present invention adopts, the present invention adopts the most frequently used bat MAP (MeanAvg-Precision) as measurement index。Test video with identical method extraction key frame and is obtained characteristic vector, according to step S4, video is carried out semantic analysis detection。Respectively SIFT feature and the convolutional neural networks model of BoW word bag model, LBP feature and histogram model and random initializtion, adopt sparse linear decoder pre-training without topology convolutional neural networks model and with the present invention based on compared with the convolutional neural networks model method of topological model pre-training。Adopting 5 times of cross-validation methods, with control methods, same test video is carried out Video Semantic Analysis testing result as shown in table 1, wherein CNN represents convolutional neural networks；LD-CNN represents the convolutional neural networks model based on sparse linear decoder pre-training；TLD-CNN represents the convolutional neural networks model based on linear topological decoder pre-training。

Table 1 Video Semantic Analysis testing result

Being drawn by the data of table 1, under identical study mechanism, the result overall synthetic index obtained based on topological model pre-training convolutional neural networks provided by the present invention is superior to other several contrast models。And in the detection of each independent semanteme generally also superior to additive method。

In sum, Video Semantic Analysis method based on topological model pre-training convolutional Neural net provided by the present invention, the solution of the present invention first has the line decoder of topological property without supervised training and combines the method having supervision sample that the convolutional neural networks model of pre-training is finely tuned on a small quantity, solves the problem that convolutional neural networks model convergence rate is slow；And when introducing topological constraints characteristic, the model parameter acquired has more the video class data sample that reply content is changeable, improves accuracy and the robustness of model。

Although the present invention is disclosed above with preferred embodiment, so it is not limited to the present invention。Persond having ordinary knowledge in the technical field of the present invention, without departing from the spirit and scope of the present invention, when being used for a variety of modifications and variations。Therefore, protection scope of the present invention is when being as the criterion depending on those as defined in claim。

Claims

1. a Video Semantic Analysis method, it is characterised in that comprise the following steps:

2. a kind of Video Semantic Analysis method according to claim 1, it is characterized in that: in described sparse linear decoder model construction process, first one line decoder model of definition, then weight decay and sparse regularization term are introduced on the mold, adjust the important dependency between this regular terms and whole object function by corresponding term coefficient, implement process as follows:

a_{j}^{(2)} = f^{(2)} (z_{j}^{(2)}) = \frac{e^{z_{j}^{(2)}} - e^{- z_{j}^{(2)}}}{e^{z_{j}^{(2)}} + e^{- z_{j}^{(2)}}} - - - (1)

z_{j}^{(2)} = Σ_{i = 1}^{n L 1} w_{j i}^{(1)} \times a_{i}^{(1)} + b_{j}^{(1)} - - - (2)

\hat{x} = a_{j}^{(3)} = f^{(3)} (z_{j}^{(3)}) = z_{j}^{(3)} - - - (3)

z_{j}^{(3)} = Σ_{i = 1}^{n L 2} w_{j i}^{(2)} \times a_{i}^{(2)} + b_{j}^{(2)} - - - (4)

Obtain autocoder target function value such as formula (5)

J (W, b : x, \hat{x}) = \frac{1}{2 b S} | | \hat{x} - x | |^{2} - - - (5)

WhereinThe output vector obtained it is input in this model for x；

J_{w - d e c a y} (W, b : x, \hat{x}) = \frac{1}{2 b S} | | \hat{x} - x | |^{2} + \frac{λ}{2} Σ_{l = 1}^{N l - 1} Σ_{i = 1}^{s_{l}} Σ_{j = 1}^{s_{l + I}} {(w_{j i}^{(l)})}^{2} - - - (6)

J_{s p a r s e} (W, b : x, \hat{x}) = J_{w - d e c a y} (W, b : x, \hat{x}) + β Σ_{j = 1}^{s_{2}} L 1 (ρ | | {\hat{ρ}}_{j}) - - - (7)

{\hat{ρ}}_{j} = \frac{1}{b S} Σ_{i = 1}^{b S} [a_{j}^{(2)} (x^{(i)})] - - - (8)

L 1 (ρ | | {\hat{ρ}}_{j}) = | | {\hat{ρ}}_{j} - ρ | | - - - (9) .

3. a kind of Video Semantic Analysis method according to claim 1, it is characterized in that: described linear topological decoder is built upon on sparse linear decoder basis, by the neuronic activation situation of hidden layer is carried out topological constraints, this model is made to become a linear topological decoder, namely by the neuron of hidden layer is grouped in order, the neuron in same group is made to have similar activation degree, and the neuron of difference group is independent mutually, making this model can learn the topological property in data, it is as follows that it realizes process:

J_{t o p o} (W, b : x, \hat{x}) = J_{s p a r s e} (W, b : x, \hat{x}) + \frac{1}{b S} γ Σ \sqrt{V \times S .^2 + ϵ} - - - (10)

F_{i j}^{(t)} = \{\begin{matrix} 1 & i, j &Element; {Sg}^{(t)} \\ 0 & o t h e r \end{matrix}, t &Element; [0, n L 2 - 1] - - - (11)

{Sg}^{(t)} = [r S t : r S e, c S t : c S e], \{\begin{matrix} r S t = \mod (t / \sqrt{n L 2} + 0, \sqrt{n L 2}) \\ r S e = \mod (t / \sqrt{n L 2} + s k, \sqrt{n L 2}) \\ c S t = \mod (\mod (t, \sqrt{n L 2}) + 0, \sqrt{n L 2}) \\ c S e = m o d (\mod (t, \sqrt{n L 2}) + s k, \sqrt{n L 2}) \end{matrix} - - - (12)

Mod function is mod；Have hence for packet matrix:

V_{(t, i \times \sqrt{n L 2} + j)} = F_{i j}^{(t)} - - - (13),

t∈[0,nL2-1],

i &Element; [0, \sqrt{n L 2} - 1], j &Element; [0, \sqrt{n L 2} - 1]

4. a kind of Video Semantic Analysis method according to claim 1, it is characterized in that, the weight parameter of the described linear topological decoder trained is as the initial parameter of convolutional neural networks, and have exemplar fine setting convolutional neural networks on a small quantity thus obtaining more excellent parameter by follow-up, implement process as follows:

x_{j}^{l} = f (\underset{i &Element; M_{j}}{Σ} x_{i}^{l - 1} * w_{i j}^{l} + b_{j}^{l}) - - - (14)

5. a kind of Video Semantic Analysis method according to claim 1, it is characterized in that: the described generic features extractor based on video of setting up is to finely tune thus obtain by the new training set that the multiframe key frame in video training set forms to convolutional neural networks model by the mode of many times of cross validations, after obtaining this generic features extractor, the feature extracted in training set and test set is input in support vector machine and carries out the classification based on video semanteme, it is achieved process is as follows: