CN110569814A

CN110569814A - Video category identification method and device, computer equipment and computer storage medium

Info

Publication number: CN110569814A
Application number: CN201910862697.6A
Authority: CN
Inventors: 肖定坤
Original assignee: Guangzhou Kugou Computer Technology Co Ltd
Current assignee: Guangzhou Kugou Computer Technology Co Ltd
Priority date: 2019-09-12
Filing date: 2019-09-12
Publication date: 2019-12-13
Anticipated expiration: 2039-09-12
Also published as: CN110569814B

Abstract

The invention discloses a video category identification method and device, computer equipment and a computer storage medium, and belongs to the field of video identification. The method comprises the following steps: acquiring video data to be identified; the video classification method comprises the steps that video data to be recognized are input into a video classification model, the video classification model comprises a space-time convolution neural network model, a space-time local aggregation description feature 3D-VLAD model and a classification recognition model which are sequentially connected, the space-time convolution neural network model comprises a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers which are sequentially stacked, the 3D-VLAD model is used for taking a space-time feature diagram output by a specified number of space-time maximum pooling layers of reciprocals in the space-time maximum pooling layers as input to obtain space-time local aggregation description features, and the classification recognition model is used for obtaining a classification result according to the space-time local aggregation description features. The problem of lower accuracy of classification result in the prior art is solved, the effect of improving the accuracy has been reached.

Description

Video category identification method and device, computer equipment and computer storage medium

Technical Field

The present invention relates to the field of video identification, and in particular, to a method and an apparatus for identifying a video category, a computer device, and a computer storage medium.

Background

At present, video big data develops a fire explosion, and content video becomes a great trend for internet development. Therefore, the identification technology for classifying videos is important.

In the video category identification method, video data to be identified is input into a space-time Convolutional Neural Network (3D-CNN) model, a feature map output by the last layer of the model is obtained, and then the feature map is input into a classification model to obtain a classification result.

however, the above method has poor capturing capability for subtle motion changes in video data, which leads to low accuracy of classification results.

disclosure of Invention

the embodiment of the invention provides a video category identification method, a video category identification device, computer equipment and a computer storage medium, which can solve the problem that the accuracy of a classification result is low due to the poor capturing capability of the related technology on fine motion changes in video data. The technical scheme is as follows:

According to a first aspect of the present invention, there is provided a video category identification method, the method comprising:

Acquiring video data to be identified;

inputting the video data to be recognized into a video classification model, wherein the video classification model comprises a space-time convolution neural network model, a space-time local aggregation description feature 3D-VLAD model and a classification recognition model which are sequentially connected, the space-time convolution neural network model comprises a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers which are sequentially stacked, after the video data to be recognized is input into the space-time convolution neural network model, the 3D-VLAD model is used for taking space-time feature graphs output by a specified number of space-time maximum pooling layers of reciprocals in the plurality of space-time maximum pooling layers as input to obtain space-time local aggregation description features, and the classification recognition model is used for obtaining a classification result according to the space-time local aggregation description features;

And obtaining the classification result of the video data to be identified output by the video classification model.

Optionally, before the obtaining of the video data to be identified, the method further includes:

Obtaining a model training sample set, wherein the model training sample set comprises a plurality of types of video sets, and each type of video set comprises a plurality of video data;

Optimizing the video classification model through the model training sample set;

stopping optimization when the video classification model converges.

Optionally, the optimizing the video classification model by the model training sample set includes:

and optimizing the video classification model by taking the model training sample set as training data according to a loss function and a gradient descent method.

Optionally, before the model training sample set is used as training data and the video classification model is optimized according to a loss function and a gradient descent method, the method further includes:

and carrying out data expansion on the model training sample set by a data enhancement method of dynamically and randomly adjusting the frame extraction number and the frame extraction rate strategy.

Optionally, the spatio-temporal convolutional neural network layer includes a formula:

O＝{O_j|j＝1,2,...,n_O}

Wherein, I_iInputting an ith space-time feature map of the I for a space-time convolution neural network layer; o is the output of the space-time convolutional neural network layer, O_jThe jth spatio-temporal feature map is O; w_ijIs I_iAnd O_ja connected convolution kernel; n is_INumber of spatio-temporal feature maps input for spatio-temporal convolutional neural network layers, n_OThe number of the space-time characteristic graphs output for the space-time convolution neural network layer; b_jIs O_jThe bias parameter of (2); f (-) is an activation function;

the spatiotemporal maximum pooling layer comprises the formula:

Y＝{y_m|m＝1,2,...,N}

wherein Y is the feature tensor output by the space-time maximum pooling layer,M-th space-time characteristic diagram O of O_mI + r of₁frame j + r₂Line t + r₃A characteristic value of the column;Is the mth space-time feature map Y in Y_mThe characteristic value of jth row and tth column of ith frame; p is a radical of₁,p₂,p₃Is O_mDimension (d); k is a radical of₁,k₂,k₃the dimensionality of the pooling kernel which is the spatio-temporal maximum pooling layer.

Optionally, Y is a feature vector with dimensions N × W × H × D, W is a width of the spatio-temporal feature map, H is a height of the spatio-temporal feature map, D is a number of channels of the spatio-temporal feature map, and the 3D-VLAD model is configured to:

converting the Y into a feature map M with dimension L multiplied by D, and converting the feature map M into a feature matrix G with dimension K multiplied by D through a conversion formula, wherein the conversion formula comprises the following steps:

Z＝M·W+B

A＝softmax(Z)

Wherein W and B are parameters of a fully-connected layer with an output neuron of K, and Z represents the output of the fully-connected layer; softmax (·) is a normalized exponential function, a is the output of the normalized exponential function; sum (·,1) represents the row-column summation of the matrices;Representing dot product operations between matrices, A^TIs a transposed matrix of the matrix A; q is a clustering center matrix parameter with dimension K multiplied by D;

transforming the feature matrix G into a feature vector with the length of K.D;

And (4) carrying out normalization on the characteristic vectors with the length of K.D through an L2 norm and a full connection layer to obtain the space-time local aggregation description characteristics.

Splicing a plurality of space-time local aggregation description features V obtained by passing a specified number of space-time maximum pooling layers with reciprocal numbers through a 3D-VLAD layer to form a space-time local aggregation description fusion feature vector V ═ V₁,v₂,...,v_n]。

Optionally, the classification recognition model is configured to:

Sequentially passing the space-time local aggregation description fusion characteristic vector V through three full-connection layers, wherein the number of neurons of the last full-connection layer in the three full-connection layers is C, and C is the number of video categories in the model training sample set;

Determining the classification result by using the output value of the last full connection layer and a probability formula, wherein the probability formula comprises:

Wherein, is p (o)_t) Probability value o of the video data to be recognized belonging to the t-th class_tthe t-th output value, o, representing the last fully-connected layer_kA k-th output value representing the last fully-connected layer; e denotes a natural constant.

In another aspect, an apparatus for identifying video category is provided, the apparatus including:

The data acquisition module is used for acquiring video data to be identified;

the data processing module is used for inputting the video data to be recognized into a video classification model, the video classification model comprises a space-time convolution neural network model, a space-time local aggregation description feature 3D-VLAD model and a classification recognition model which are sequentially connected, the space-time convolution neural network model comprises a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers which are sequentially stacked, after the video data to be recognized is input into the space-time convolution neural network model, the 3D-VLAD model is used for taking space-time feature graphs output by a specified number of space-time maximum pooling layers of reciprocals in the plurality of space-time maximum pooling layers as input to obtain space-time local aggregation description features, and the classification recognition model is used for obtaining a classification result according to the space-time local aggregation description features;

And the result acquisition module is used for acquiring the classification result of the video data to be identified output by the video classification model.

In one aspect, a computer device is provided, which includes a processor and a memory, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the video category identification method described above.

in one aspect, a computer storage medium is provided, which stores instructions that, when executed on a computer, cause the computer to perform the above-mentioned video category identification method.

the technical scheme provided by the embodiment of the invention has the beneficial effects that at least:

the video data to be recognized are input into a video classification model by obtaining the video data to be recognized, the video classification model comprises a space-time convolution neural network model, a space-time local aggregation description feature 3D-VLAD model and a classification recognition model which are sequentially connected, wherein the space-time convolution neural network model comprises a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers which are sequentially stacked, after the video data to be recognized are input into the space-time convolution neural network model, the 3D-VLAD model is used for taking a space-time feature diagram output by a specified number of space-time maximum pooling layers of reciprocals in the plurality of space-time maximum pooling layers as input to obtain space-time local aggregation description features, and the classification recognition model is used for obtaining a classification result according to the space-time local aggregation description features; and obtaining a classification result of the video data to be identified output by the video classification model. And inputting the space-time local aggregation description characteristics of a plurality of space-time characteristic graphs into a classification recognition model to obtain a more detailed classification result. The problem of among the prior art to the capture ability of slight action change among the video data relatively poor, and then lead to the accuracy of classification result lower is solved, the effect of improvement accuracy has been reached.

drawings

in order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a video category identification method according to an embodiment of the present invention;

FIG. 2 is a flow chart of another video category identification method provided by the embodiment of the invention;

FIG. 3 is a schematic structural diagram of a video classification model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of the model of Block of FIG. 3;

Fig. 5 is a schematic structural diagram of a video category identification device according to an embodiment of the present application;

Fig. 6 is a schematic structural diagram of a server according to an embodiment of the present invention.

With the above figures, certain embodiments of the invention have been illustrated and described in more detail below. The drawings and the description are not intended to limit the scope of the inventive concept in any way, but rather to illustrate it by those skilled in the art with reference to specific embodiments.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

in the prior art, the identification techniques for classifying videos generally include the following:

The identification method of the double-current 2D convolutional neural network comprises the following steps: respectively training two independent models of a space 2D convolutional neural network based on an RGB (red, green and blue) diagram and a time sequence 2D convolutional neural network based on a light flow diagram, and fusing the outputs of the two convolutional neural network models to obtain a final recognition result. However, in the method, a large amount of computing power and time are consumed for extracting the optical flow, the double-flow network is separately and independently trained, no interaction of space-time information exists in the training process, and the characteristics of space and time sequence cannot be well fused; and because the spatial network adopts a key single-frame RGB image in a video clip, the modeling of the long-range time context can not be carried out.

The identification method of the long-time memory LSTM network comprises the following steps: the method comprises the steps of firstly extracting spatial features from video sequence frames by using a trained 2D convolutional neural network CNN, and then performing context feature extraction modeling on the extracted spatial features in a time sequence by using a long-time memory LSTM network. However, the feature extraction of the method is completed in two stages, end-to-end joint training is not performed, and the method is poor in performance of extracting a short-time fine timing sequence relation.

The identification method of the space-time 3D convolutional neural network comprises the following steps: and inputting the video data to be identified into a space-time convolution neural network model, acquiring a feature map output by the last layer of the model, and inputting the feature map into a classification model to obtain a classification result. However, in the method, only the last layer of feature map is input into the classification model, and a large amount of space-time semantic detail information is lost in the last layer of feature map due to the pooling operation, so that the model has weak capturing capability on fine motion changes.

embodiments of the present invention provide a video category identification method and apparatus, a computer device, and a computer storage medium, which can solve the problems in the related art.

fig. 1 is a flowchart illustrating a video category identification method according to an embodiment of the present invention, where the video category identification method may include the following steps:

Step 101, video data to be identified are obtained.

step 102, inputting video data to be recognized into a video classification model, wherein the video classification model comprises a space-time convolution neural network model, a space-time local aggregation description feature 3D-VLAD model and a classification recognition model which are sequentially connected, the space-time convolution neural network model comprises a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers which are sequentially stacked, after the video data to be recognized is input into the space-time convolution neural network model, the 3D-VLAD model is used for taking space-time feature graphs output by space-time maximum pooling layers with specified number of reciprocals in the plurality of space-time maximum pooling layers as input to obtain space-time local aggregation description features, and the classification recognition model is used for obtaining a classification result according to the space-time local aggregation description features.

And 103, acquiring a classification result of the video data to be identified output by the video classification model.

In summary, the video category identification method provided by the embodiment of the present invention obtains the video data to be identified, inputs the video data to be identified into the video classification model, the video classification model comprises a space-time convolution neural network model, a space-time local aggregation description feature 3D-VLAD model and a classification identification model which are connected in sequence, wherein the space-time convolution neural network model comprises a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers which are stacked in sequence, after the video data to be identified is input into the space-time convolution neural network model, the 3D-VLAD model is used for taking a space-time feature map output by a specified number of space-time maximum pooling layers with reciprocal in the plurality of space-time maximum pooling layers as input to obtain space-time local aggregation description features, and the classification identification model is used for obtaining a classification result according to the space-time local aggregation description features; and obtaining a classification result of the video data to be identified output by the video classification model. And inputting the space-time local aggregation description characteristics of a plurality of space-time characteristic graphs into a classification recognition model to obtain a more detailed classification result. The problem of among the prior art to the capture ability of slight action change among the video data relatively poor, and then lead to the accuracy of classification result lower is solved, the effect of improvement accuracy has been reached.

Referring to fig. 2, a flowchart of another video category identification method provided by an embodiment of the present invention is shown, where the video category identification method may include the following steps:

step 201, obtaining a model training sample set, where the model training sample set includes a plurality of types of video sets, and each type of video set includes a plurality of video data.

collecting C video segment data sets with different categories, then sampling each video segment in the data sets for n frames of RGB images at intervals of 0.1 second to form time sequence frame sample x_i

whereinRepresents a sample x_iRGB image of the k-th frame.

sample x_icomposition sample set X ═ X₁,x₂,...,x_i,...,x_NR ═ R₁,r₂,...,r_i,...,r_NRecord each sample X in the sample set X_iwherein R is a C-dimensional One-Hot (One-bit significant) code vector.

and 202, constructing a model architecture of the video classification model.

Fig. 3 is a schematic structural diagram of a video classification model. The spatio-temporal convolutional neural network model 32 comprises a plurality of spatio-temporal convolutional neural network layers 321 (for example, the perception field can be 7x7x7), a plurality of spatio-temporal maximum pooling layers 322 (for example, the perception field can be 1x3x3, 3x3x3 or 1x2x2, etc.), and a plurality of blocks 323 (which is a processing unit in the model and can comprise the spatio-temporal convolutional neural network layers and the spatio-temporal maximum pooling layers); the 3D-VLAD model 33 includes a plurality of 3D-VLAD layers 331, 332, 333, and 334 and spatio-temporal local aggregate description fusion features 335; the classification recognition model 34 includes three fully connected layers 341, 342, and 343, and illustratively, the number of output neurons of the connected layers 341 and 342 may be 1024, and the number of output neurons of the connected layer 343 is C, where C represents the number of video categories. Where 31 is the data input to the spatio-temporal convolutional neural network layer 321.

The process of building a video classification model at step 202 may include:

1) and constructing a space-time convolution neural network model. The video classification model comprises a space-time convolution neural network model, a space-time local aggregation description feature 3D-VLAD model and a classification identification model which are sequentially connected.

The space-time convolution neural network is a network model formed by stacking and combining a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers, can capture motion information in a video sequence, and extracts spatial and temporal 3D characteristics. FIG. 4 is a schematic model diagram of Block in FIG. 3. Including a spatio-temporal convolutional neural network layer 321 (illustratively, the perceptual field of view may be 1x1x1, 3x3x 3); spatiotemporal maximum pooling layer 322 (illustratively, the perceived field of view may be 3x3x 3). Wherein the space-time convolutional neural network layer comprises the formula:

O＝{O_j|j＝1,2,...,n_O}

wherein, I_iinputting an ith space-time feature map of the I for a space-time convolution neural network layer; o is the output of the space-time convolutional neural network layer, O_jthe jth spatio-temporal feature map is O; w_ijIs I_iAnd O_jA connected convolution kernel; n is_Inumber of spatio-temporal feature maps input for spatio-temporal convolutional neural network layers, n_OThe number of the space-time characteristic graphs output for the space-time convolution neural network layer; b_jIs O_jthe bias parameter of (2); f (-) is an activation function,Representing a 3D convolution operation.

The spatiotemporal maximum pooling layer comprises the formula:

Y＝{y_m|m＝1,2,...,N}

Wherein Y isThe feature tensor of the empty max pooling layer output,M-th space-time characteristic diagram O of O_mI + r of₁Frame j + r₂line t + r₃A characteristic value of the column;Is the mth space-time feature map Y in Y_mThe characteristic value of jth row and tth column of ith frame; p is a radical of₁,p₂,p₃is O_mDimension (d); k is a radical of₁,k₂,k₃The dimensionality of the pooling kernel which is the spatio-temporal maximum pooling layer.

2) and constructing a 3D-VLAD model.

The 3D-VLAD model is characterized in that three-dimensional feature graphs with different sizes output by the reciprocal n maximum pooling layers in the space-time convolution neural network model are respectively used as the input of the 3D-VLAD layer, space-time local aggregation description features on n different sizes are extracted, and each space-time local aggregation description feature v_iHas a dimension size of 512; these spatio-temporal local aggregations are characterized by v_iSpliced into a space-time local aggregation description fusion characteristic V ═ V [ n ] with the length of 512 · n₁,v₂,...,v_i,...,v_n]. The 3D-VLAD layer can capture statistical information of aggregation of local features of the three-dimensional feature map in a global video time sequence by extracting space-time local aggregation description features from the feature tensor Y output by the maximum pooling layer, and is a method for expressing the aggregated local features to the global features.

Y is a characteristic vector with dimension of NxWxH xD, W is the width of the space-time characteristic diagram, H is the height of the space-time characteristic diagram, D is the number of channels of the space-time characteristic diagram, and the 3D-VLAD model is used for:

converting Y into a characteristic diagram M with dimension L multiplied by D, and converting the characteristic diagram M into a characteristic matrix G with dimension K multiplied by D through a conversion formula, wherein the conversion formula comprises the following steps:

Z＝M·W+B

A＝softmax(Z)

Wherein, W and B are parameters of a full connection layer with an output neuron of K, and Z represents the output of the full connection layer; softmax (·) is a normalized exponential function, a is the output of the normalized exponential function; sum (·,1) represents the row-column summation of the matrices;representing dot product operations between matrices, A^TIs a transposed matrix of the matrix A; q is a clustering center matrix parameter with dimension K multiplied by D.

the feature matrix G is transformed into a feature vector of length K · D.

the feature vector with the length of K.D passes through an L2 norm normalization layer and a full connection layer with 512 output neurons to obtain space-time local aggregation description feature v_isplicing a plurality of space-time local aggregation description characteristics V obtained by passing a specified number of space-time maximum pooling layers with reciprocal numbers through a 3D-VLAD layer to form a fusion characteristic vector V ═ V₁,v₂,...,v_n]

3) And constructing a classification recognition model.

The classification recognition model is used for: and sequentially passing the space-time local aggregation description fusion characteristics V through three full-connection layers, wherein the number of the neurons of the last full-connection layer in the three full-connection layers is C, and C is the number of the video types in the model training sample set.

Determining a classification result according to an output value of the last full-connection layer and a probability formula, wherein the probability formula comprises the following steps:

Wherein, is p (o)_t) Probability value, o, of the video data to be recognized belonging to the t-th class_tthe t-th output value, o, representing the last fully-connected layer_kA k-th output value representing the last fully-connected layer; e denotes a natural constant. p (o)_t) Can show that the video data to be identified belong toprobability values for each of the C categories of video. Illustratively, the probability that the video data to be recognized belongs to the animation video is 85%, the probability that the video data to be recognized belongs to the funny video is 10%, and the probability that the video data to be recognized belongs to the literature video is 5%, wherein the highest value of the probability values can be used as a judgment standard under which the video data to be recognized belongs to the animation video. Other criteria may be used as the determination criteria, and the embodiment of the present invention is not limited herein.

Step 203, performing data expansion on the model training sample set by a data enhancement method of dynamically and randomly adjusting the extraction frame number and the extraction frame rate strategy.

in training the network model, for each batch of samples input into the network, the extraction interval is from [1,2,4 ]]in the random selection of a number mu₁The number of frames is extracted from [4,8,16 ]]In the random selection of a number mu₂Then to time series frame sample x_iMiddle every mu₁Decimate a frame to extract mu₂The frame structure represents the time-series frame sample x_i. Thereby achieving the purpose of improving the identification effect.

And 204, optimizing the video classification model according to a loss function and a gradient descent method by using the model training sample set as training data.

The loss function is used for describing the loss of the system under different parameter values, the loss function is minimized, the fitting degree is best, and the corresponding model parameter is the optimal parameter. When the loss function is minimized, iterative solution can be performed step by step through a gradient descent method, and the minimized loss function and the model parameter value are obtained. The method may refer to related technologies, and details are not described in the embodiments of the present invention.

And step 205, stopping optimization when the video classification model is converged.

and performing end-to-end model optimization on the whole model until the video classification model is converged, and stopping optimization.

And detecting and verifying the trained and optimized video classification model by using the test sample, and training the video classification model again according to the test result so as to improve the identification effect. Wherein the test sample may be preset video data to be identified.

The steps 201 to 205 are processes of constructing a model, and the subsequent steps 206 to 208 are processes of applying the constructed video classification model.

and step 206, acquiring the video data to be identified.

And acquiring the video data to be identified, and bringing the video data to be identified into the step 207. The video data to be identified is different from the video data in step 201, and is random video data to be identified which the user wants to classify.

And step 207, inputting the video data to be recognized into a video classification model, inputting the video data to be recognized into a space-time convolution neural network model, and then using the 3D-VLAD model to input a specified number of space-time feature maps output by a specified number of space-time maximum pooling layers with a reciprocal number in the multiple space-time maximum pooling layers to obtain space-time local aggregation description features, wherein the classification recognition model is used for obtaining a classification result according to the space-time local aggregation description features.

inputting the video data to be recognized into the model constructed in step 202, where the video classification model includes a space-time convolution neural network model, a space-time local aggregation description feature 3D-VLAD model, and a classification recognition model, which are connected in sequence, and the space-time convolution neural network model includes a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers, which are stacked in sequence. And processing the video data to be identified by using the model.

And step 208, obtaining a classification result of the video data to be recognized output by the video classification model.

The classification result of the video data to be recognized may include a probability value that the video data to be recognized belongs to each of the C-category videos. The classification result of the video data to be recognized, which is obtained from the video classification model, takes the value of each dimensionality of the space-time feature point into consideration, so that the local video information is more finely depicted, and the video classification model has good capturing capability on fine motion change. By extracting the 3D-VLAD characteristics of the characteristic diagram of the reciprocal n layer, the fusion of multi-scale characteristics is achieved, and the recognition rate can be effectively improved; the data enhancement method of dynamically and randomly adjusting the extracted frame number and the extracted interval strategy is used for dynamically adjusting the problems of inconsistent speed of action change rhythm in the video and inconsistent span length of video content, so that the generalization capability of the model is enhanced; and the video recognition method is combined with training and tuning, improves the recognition accuracy and realizes end-to-end video recognition.

Fig. 5 is a schematic structural diagram of a video category identification device according to an embodiment of the present application, and as shown in fig. 5, the video category identification device 500 includes:

And the data obtaining module 501 is configured to obtain video data to be identified.

The data processing module 502 is configured to input video data to be recognized into a video classification model, where the video classification model includes a space-time convolution neural network model, a space-time local aggregation description feature 3D-VLAD model, and a classification recognition model that are sequentially connected, the space-time convolution neural network model includes a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers that are sequentially stacked, after the video data to be recognized is input into the space-time convolution neural network model, the 3D-VLAD model is configured to obtain space-time local aggregation description features by using space-time feature maps output by a specified number of space-time maximum pooling layers of reciprocals in the plurality of space-time maximum pooling layers as inputs, and the classification recognition model is configured to obtain a classification result according to the space-time local aggregation description features.

And the result obtaining module 503 is configured to obtain a classification result of the to-be-identified video data output by the video classification model.

in summary, the video category identification apparatus provided in the embodiments of the present invention inputs the video data to be identified into the video classification model by acquiring the video data to be identified, the video classification model comprises a space-time convolution neural network model, a space-time local aggregation description feature 3D-VLAD model and a classification identification model which are connected in sequence, wherein the space-time convolution neural network model comprises a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers which are stacked in sequence, after the video data to be identified is input into the space-time convolution neural network model, the 3D-VLAD model is used for taking a space-time feature map output by a specified number of space-time maximum pooling layers with reciprocal in the plurality of space-time maximum pooling layers as input to obtain space-time local aggregation description features, and the classification identification model is used for obtaining a classification result according to the space-time local aggregation description features; and obtaining a classification result of the video data to be identified output by the video classification model. And inputting the space-time local aggregation description characteristics of a plurality of space-time characteristic graphs into a classification recognition model to obtain a more detailed classification result. The problem of among the prior art to the capture ability of slight action change among the video data relatively poor, and then lead to the accuracy of classification result lower is solved, the effect of improvement accuracy has been reached.

Fig. 6 is a schematic structural diagram of a server according to an embodiment of the present invention. Specifically, the method comprises the following steps:

The server 600 includes a Central Processing Unit (CPU) 601, a system Memory 604 including a Random Access Memory (RAM) 602 and a Read Only Memory (ROM) 603, and a system bus 605 connecting the system Memory 604 and the Central Processing Unit 601. The server 600 also includes a basic input/output system (I/O system) 606, which facilitates the transfer of information between devices within the computer, and a mass storage device 607, which stores an operating system 613, application programs 614, and other program modules 615.

The basic input/output system 606 includes a display 608 for displaying information and an input device 609 such as a mouse, keyboard, etc. for user input of information. Wherein a display 608 and an input device 609 are connected to the central processing unit 601 through an input output controller 610 connected to the system bus 605. The basic input/output system 606 may also include an input/output controller 610 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input/output controller 610 may also provide output to a display screen, a printer, or other type of output device.

the mass storage device 607 is connected to the central processing unit 601 through a mass storage controller (not shown) connected to the system bus 605. The mass storage device 607 and its associated computer-readable media provide non-volatile storage for the server 600. That is, mass storage device 607 may include a computer-readable medium (not shown) such as a hard disk or CD-ROM (compact disk Read-Only Memory) drive.

computer-readable media may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM (Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash Memory or other solid state Memory technology, CD-ROM, DVD (Digital Versatile disk), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 604 and mass storage device 607 described above may be collectively referred to as memory.

According to various embodiments of the invention, the server 600 may also operate as a remote computer connected to a network through a network, such as the Internet. That is, the server 600 may be connected to the network 612 through the network interface unit 611 connected to the system bus 605, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 611.

The memory further includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by the CPU.

The present application further provides a computer device comprising a processor and a memory, wherein at least one instruction, at least one program, code set, or instruction set is stored in the memory, and the at least one instruction, the at least one program, code set, or instruction set is loaded and executed by the processor to implement the video category identification method described above.

The present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and the video category identification apparatus executes the instructions to enable the video category identification apparatus to implement the video category identification method provided in the foregoing embodiment.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for video category identification, the method comprising:

Acquiring video data to be identified;

2. The method of claim 1, wherein prior to obtaining the video data to be identified, the method further comprises:

Stopping optimization when the video classification model converges.

3. The method of claim 2, wherein the optimizing the video classification model by the model training sample set comprises:

4. The method of claim 3, wherein before optimizing the video classification model according to a loss function and a gradient descent method using the model training sample set as training data, the method further comprises:

5. The method of claim 1, wherein the spatio-temporal convolutional neural network layer comprises the formula:

O＝{O_j|j＝1,2,...,n_O}

the spatiotemporal maximum pooling layer comprises the formula:

Y＝{y_m|m＝1,2,...,N}

wherein Y is the feature tensor output by the space-time maximum pooling layer,M-th space-time characteristic diagram O of O_mI + r of₁Frame j + r₂Line t + r₃A characteristic value of the column;Is the mth space-time feature map Y in Y_mThe characteristic value of jth row and tth column of ith frame; p is a radical of₁,p₂,p₃Is O_mDimension (d); k is a radical of₁,k₂,k₃For maximum pooling in space-timeThe dimensionality of the pooled kernel.

6. The method of claim 5, wherein Y is a feature vector with dimensions NxWxHxD, W is a width of the spatio-temporal feature map, H is a height of the spatio-temporal feature map, and D is a number of channels of the spatio-temporal feature map, and the 3D-VLAD model is configured to:

Z＝M·W+B

A＝softmax(Z)

Transforming the feature matrix G into a feature vector with the length of K.D;

Respectively subjecting the characteristic vectors with the length of K.D to L2 norm normalization and a layer of full connection layer to obtain the space-time local polymerization description characteristic v;

Splicing a plurality of space-time local aggregation description characteristics V obtained by passing a specified number of space-time maximum pooling layers with reciprocal numbers through a 3D-VLAD layer to form a fusion characteristic vector V ═ V₁,v₂,...,v_n]。

7. the method of claim 6, wherein the classification recognition model is used to:

8. an apparatus for video category identification, the apparatus comprising:

The data acquisition module is used for acquiring video data to be identified;

9. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the video category identification method according to any one of claims 1 to 7.

10. A computer storage medium having stored therein instructions that, when run on a computer, cause the computer to perform the video category identification method of any one of claims 1 to 7.