CN110569814A - Video category identification method and device, computer equipment and computer storage medium - Google Patents

Video category identification method and device, computer equipment and computer storage medium Download PDF

Info

Publication number
CN110569814A
CN110569814A CN201910862697.6A CN201910862697A CN110569814A CN 110569814 A CN110569814 A CN 110569814A CN 201910862697 A CN201910862697 A CN 201910862697A CN 110569814 A CN110569814 A CN 110569814A
Authority
CN
China
Prior art keywords
space
time
model
video
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910862697.6A
Other languages
Chinese (zh)
Other versions
CN110569814B (en
Inventor
肖定坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Kugou Computer Technology Co Ltd
Original Assignee
Guangzhou Kugou Computer Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Kugou Computer Technology Co Ltd filed Critical Guangzhou Kugou Computer Technology Co Ltd
Priority to CN201910862697.6A priority Critical patent/CN110569814B/en
Publication of CN110569814A publication Critical patent/CN110569814A/en
Application granted granted Critical
Publication of CN110569814B publication Critical patent/CN110569814B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video category identification method and device, computer equipment and a computer storage medium, and belongs to the field of video identification. The method comprises the following steps: acquiring video data to be identified; the video classification method comprises the steps that video data to be recognized are input into a video classification model, the video classification model comprises a space-time convolution neural network model, a space-time local aggregation description feature 3D-VLAD model and a classification recognition model which are sequentially connected, the space-time convolution neural network model comprises a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers which are sequentially stacked, the 3D-VLAD model is used for taking a space-time feature diagram output by a specified number of space-time maximum pooling layers of reciprocals in the space-time maximum pooling layers as input to obtain space-time local aggregation description features, and the classification recognition model is used for obtaining a classification result according to the space-time local aggregation description features. The problem of lower accuracy of classification result in the prior art is solved, the effect of improving the accuracy has been reached.

Description

Video category identification method and device, computer equipment and computer storage medium
Technical Field
The present invention relates to the field of video identification, and in particular, to a method and an apparatus for identifying a video category, a computer device, and a computer storage medium.
Background
At present, video big data develops a fire explosion, and content video becomes a great trend for internet development. Therefore, the identification technology for classifying videos is important.
In the video category identification method, video data to be identified is input into a space-time Convolutional Neural Network (3D-CNN) model, a feature map output by the last layer of the model is obtained, and then the feature map is input into a classification model to obtain a classification result.
however, the above method has poor capturing capability for subtle motion changes in video data, which leads to low accuracy of classification results.
disclosure of Invention
the embodiment of the invention provides a video category identification method, a video category identification device, computer equipment and a computer storage medium, which can solve the problem that the accuracy of a classification result is low due to the poor capturing capability of the related technology on fine motion changes in video data. The technical scheme is as follows:
According to a first aspect of the present invention, there is provided a video category identification method, the method comprising:
Acquiring video data to be identified;
inputting the video data to be recognized into a video classification model, wherein the video classification model comprises a space-time convolution neural network model, a space-time local aggregation description feature 3D-VLAD model and a classification recognition model which are sequentially connected, the space-time convolution neural network model comprises a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers which are sequentially stacked, after the video data to be recognized is input into the space-time convolution neural network model, the 3D-VLAD model is used for taking space-time feature graphs output by a specified number of space-time maximum pooling layers of reciprocals in the plurality of space-time maximum pooling layers as input to obtain space-time local aggregation description features, and the classification recognition model is used for obtaining a classification result according to the space-time local aggregation description features;
And obtaining the classification result of the video data to be identified output by the video classification model.
Optionally, before the obtaining of the video data to be identified, the method further includes:
Obtaining a model training sample set, wherein the model training sample set comprises a plurality of types of video sets, and each type of video set comprises a plurality of video data;
Optimizing the video classification model through the model training sample set;
stopping optimization when the video classification model converges.
Optionally, the optimizing the video classification model by the model training sample set includes:
and optimizing the video classification model by taking the model training sample set as training data according to a loss function and a gradient descent method.
Optionally, before the model training sample set is used as training data and the video classification model is optimized according to a loss function and a gradient descent method, the method further includes:
and carrying out data expansion on the model training sample set by a data enhancement method of dynamically and randomly adjusting the frame extraction number and the frame extraction rate strategy.
Optionally, the spatio-temporal convolutional neural network layer includes a formula:
O={Oj|j=1,2,...,nO}
Wherein, IiInputting an ith space-time feature map of the I for a space-time convolution neural network layer; o is the output of the space-time convolutional neural network layer, OjThe jth spatio-temporal feature map is O; wijIs IiAnd Oja connected convolution kernel; n isINumber of spatio-temporal feature maps input for spatio-temporal convolutional neural network layers, nOThe number of the space-time characteristic graphs output for the space-time convolution neural network layer; bjIs OjThe bias parameter of (2); f (-) is an activation function;
the spatiotemporal maximum pooling layer comprises the formula:
Y={ym|m=1,2,...,N}
wherein Y is the feature tensor output by the space-time maximum pooling layer,M-th space-time characteristic diagram O of OmI + r of1frame j + r2Line t + r3A characteristic value of the column;Is the mth space-time feature map Y in YmThe characteristic value of jth row and tth column of ith frame; p is a radical of1,p2,p3Is OmDimension (d); k is a radical of1,k2,k3the dimensionality of the pooling kernel which is the spatio-temporal maximum pooling layer.
Optionally, Y is a feature vector with dimensions N × W × H × D, W is a width of the spatio-temporal feature map, H is a height of the spatio-temporal feature map, D is a number of channels of the spatio-temporal feature map, and the 3D-VLAD model is configured to:
converting the Y into a feature map M with dimension L multiplied by D, and converting the feature map M into a feature matrix G with dimension K multiplied by D through a conversion formula, wherein the conversion formula comprises the following steps:
Z=M·W+B
A=softmax(Z)
Wherein W and B are parameters of a fully-connected layer with an output neuron of K, and Z represents the output of the fully-connected layer; softmax (·) is a normalized exponential function, a is the output of the normalized exponential function; sum (·,1) represents the row-column summation of the matrices;Representing dot product operations between matrices, ATIs a transposed matrix of the matrix A; q is a clustering center matrix parameter with dimension K multiplied by D;
transforming the feature matrix G into a feature vector with the length of K.D;
And (4) carrying out normalization on the characteristic vectors with the length of K.D through an L2 norm and a full connection layer to obtain the space-time local aggregation description characteristics.
Splicing a plurality of space-time local aggregation description features V obtained by passing a specified number of space-time maximum pooling layers with reciprocal numbers through a 3D-VLAD layer to form a space-time local aggregation description fusion feature vector V ═ V1,v2,...,vn]。
Optionally, the classification recognition model is configured to:
Sequentially passing the space-time local aggregation description fusion characteristic vector V through three full-connection layers, wherein the number of neurons of the last full-connection layer in the three full-connection layers is C, and C is the number of video categories in the model training sample set;
Determining the classification result by using the output value of the last full connection layer and a probability formula, wherein the probability formula comprises:
Wherein, is p (o)t) Probability value o of the video data to be recognized belonging to the t-th classtthe t-th output value, o, representing the last fully-connected layerkA k-th output value representing the last fully-connected layer; e denotes a natural constant.
In another aspect, an apparatus for identifying video category is provided, the apparatus including:
The data acquisition module is used for acquiring video data to be identified;
the data processing module is used for inputting the video data to be recognized into a video classification model, the video classification model comprises a space-time convolution neural network model, a space-time local aggregation description feature 3D-VLAD model and a classification recognition model which are sequentially connected, the space-time convolution neural network model comprises a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers which are sequentially stacked, after the video data to be recognized is input into the space-time convolution neural network model, the 3D-VLAD model is used for taking space-time feature graphs output by a specified number of space-time maximum pooling layers of reciprocals in the plurality of space-time maximum pooling layers as input to obtain space-time local aggregation description features, and the classification recognition model is used for obtaining a classification result according to the space-time local aggregation description features;
And the result acquisition module is used for acquiring the classification result of the video data to be identified output by the video classification model.
In one aspect, a computer device is provided, which includes a processor and a memory, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the video category identification method described above.
in one aspect, a computer storage medium is provided, which stores instructions that, when executed on a computer, cause the computer to perform the above-mentioned video category identification method.
the technical scheme provided by the embodiment of the invention has the beneficial effects that at least:
the video data to be recognized are input into a video classification model by obtaining the video data to be recognized, the video classification model comprises a space-time convolution neural network model, a space-time local aggregation description feature 3D-VLAD model and a classification recognition model which are sequentially connected, wherein the space-time convolution neural network model comprises a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers which are sequentially stacked, after the video data to be recognized are input into the space-time convolution neural network model, the 3D-VLAD model is used for taking a space-time feature diagram output by a specified number of space-time maximum pooling layers of reciprocals in the plurality of space-time maximum pooling layers as input to obtain space-time local aggregation description features, and the classification recognition model is used for obtaining a classification result according to the space-time local aggregation description features; and obtaining a classification result of the video data to be identified output by the video classification model. And inputting the space-time local aggregation description characteristics of a plurality of space-time characteristic graphs into a classification recognition model to obtain a more detailed classification result. The problem of among the prior art to the capture ability of slight action change among the video data relatively poor, and then lead to the accuracy of classification result lower is solved, the effect of improvement accuracy has been reached.
drawings
in order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of a video category identification method according to an embodiment of the present invention;
FIG. 2 is a flow chart of another video category identification method provided by the embodiment of the invention;
FIG. 3 is a schematic structural diagram of a video classification model according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of the model of Block of FIG. 3;
Fig. 5 is a schematic structural diagram of a video category identification device according to an embodiment of the present application;
Fig. 6 is a schematic structural diagram of a server according to an embodiment of the present invention.
With the above figures, certain embodiments of the invention have been illustrated and described in more detail below. The drawings and the description are not intended to limit the scope of the inventive concept in any way, but rather to illustrate it by those skilled in the art with reference to specific embodiments.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
in the prior art, the identification techniques for classifying videos generally include the following:
The identification method of the double-current 2D convolutional neural network comprises the following steps: respectively training two independent models of a space 2D convolutional neural network based on an RGB (red, green and blue) diagram and a time sequence 2D convolutional neural network based on a light flow diagram, and fusing the outputs of the two convolutional neural network models to obtain a final recognition result. However, in the method, a large amount of computing power and time are consumed for extracting the optical flow, the double-flow network is separately and independently trained, no interaction of space-time information exists in the training process, and the characteristics of space and time sequence cannot be well fused; and because the spatial network adopts a key single-frame RGB image in a video clip, the modeling of the long-range time context can not be carried out.
The identification method of the long-time memory LSTM network comprises the following steps: the method comprises the steps of firstly extracting spatial features from video sequence frames by using a trained 2D convolutional neural network CNN, and then performing context feature extraction modeling on the extracted spatial features in a time sequence by using a long-time memory LSTM network. However, the feature extraction of the method is completed in two stages, end-to-end joint training is not performed, and the method is poor in performance of extracting a short-time fine timing sequence relation.
The identification method of the space-time 3D convolutional neural network comprises the following steps: and inputting the video data to be identified into a space-time convolution neural network model, acquiring a feature map output by the last layer of the model, and inputting the feature map into a classification model to obtain a classification result. However, in the method, only the last layer of feature map is input into the classification model, and a large amount of space-time semantic detail information is lost in the last layer of feature map due to the pooling operation, so that the model has weak capturing capability on fine motion changes.
embodiments of the present invention provide a video category identification method and apparatus, a computer device, and a computer storage medium, which can solve the problems in the related art.
fig. 1 is a flowchart illustrating a video category identification method according to an embodiment of the present invention, where the video category identification method may include the following steps:
Step 101, video data to be identified are obtained.
step 102, inputting video data to be recognized into a video classification model, wherein the video classification model comprises a space-time convolution neural network model, a space-time local aggregation description feature 3D-VLAD model and a classification recognition model which are sequentially connected, the space-time convolution neural network model comprises a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers which are sequentially stacked, after the video data to be recognized is input into the space-time convolution neural network model, the 3D-VLAD model is used for taking space-time feature graphs output by space-time maximum pooling layers with specified number of reciprocals in the plurality of space-time maximum pooling layers as input to obtain space-time local aggregation description features, and the classification recognition model is used for obtaining a classification result according to the space-time local aggregation description features.
And 103, acquiring a classification result of the video data to be identified output by the video classification model.
In summary, the video category identification method provided by the embodiment of the present invention obtains the video data to be identified, inputs the video data to be identified into the video classification model, the video classification model comprises a space-time convolution neural network model, a space-time local aggregation description feature 3D-VLAD model and a classification identification model which are connected in sequence, wherein the space-time convolution neural network model comprises a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers which are stacked in sequence, after the video data to be identified is input into the space-time convolution neural network model, the 3D-VLAD model is used for taking a space-time feature map output by a specified number of space-time maximum pooling layers with reciprocal in the plurality of space-time maximum pooling layers as input to obtain space-time local aggregation description features, and the classification identification model is used for obtaining a classification result according to the space-time local aggregation description features; and obtaining a classification result of the video data to be identified output by the video classification model. And inputting the space-time local aggregation description characteristics of a plurality of space-time characteristic graphs into a classification recognition model to obtain a more detailed classification result. The problem of among the prior art to the capture ability of slight action change among the video data relatively poor, and then lead to the accuracy of classification result lower is solved, the effect of improvement accuracy has been reached.
Referring to fig. 2, a flowchart of another video category identification method provided by an embodiment of the present invention is shown, where the video category identification method may include the following steps:
step 201, obtaining a model training sample set, where the model training sample set includes a plurality of types of video sets, and each type of video set includes a plurality of video data.
collecting C video segment data sets with different categories, then sampling each video segment in the data sets for n frames of RGB images at intervals of 0.1 second to form time sequence frame sample xi
whereinRepresents a sample xiRGB image of the k-th frame.
sample xicomposition sample set X ═ X1,x2,...,xi,...,xNR ═ R1,r2,...,ri,...,rNRecord each sample X in the sample set Xiwherein R is a C-dimensional One-Hot (One-bit significant) code vector.
and 202, constructing a model architecture of the video classification model.
Fig. 3 is a schematic structural diagram of a video classification model. The spatio-temporal convolutional neural network model 32 comprises a plurality of spatio-temporal convolutional neural network layers 321 (for example, the perception field can be 7x7x7), a plurality of spatio-temporal maximum pooling layers 322 (for example, the perception field can be 1x3x3, 3x3x3 or 1x2x2, etc.), and a plurality of blocks 323 (which is a processing unit in the model and can comprise the spatio-temporal convolutional neural network layers and the spatio-temporal maximum pooling layers); the 3D-VLAD model 33 includes a plurality of 3D-VLAD layers 331, 332, 333, and 334 and spatio-temporal local aggregate description fusion features 335; the classification recognition model 34 includes three fully connected layers 341, 342, and 343, and illustratively, the number of output neurons of the connected layers 341 and 342 may be 1024, and the number of output neurons of the connected layer 343 is C, where C represents the number of video categories. Where 31 is the data input to the spatio-temporal convolutional neural network layer 321.
The process of building a video classification model at step 202 may include:
1) and constructing a space-time convolution neural network model. The video classification model comprises a space-time convolution neural network model, a space-time local aggregation description feature 3D-VLAD model and a classification identification model which are sequentially connected.
The space-time convolution neural network is a network model formed by stacking and combining a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers, can capture motion information in a video sequence, and extracts spatial and temporal 3D characteristics. FIG. 4 is a schematic model diagram of Block in FIG. 3. Including a spatio-temporal convolutional neural network layer 321 (illustratively, the perceptual field of view may be 1x1x1, 3x3x 3); spatiotemporal maximum pooling layer 322 (illustratively, the perceived field of view may be 3x3x 3). Wherein the space-time convolutional neural network layer comprises the formula:
O={Oj|j=1,2,...,nO}
wherein, Iiinputting an ith space-time feature map of the I for a space-time convolution neural network layer; o is the output of the space-time convolutional neural network layer, Ojthe jth spatio-temporal feature map is O; wijIs IiAnd OjA connected convolution kernel; n isInumber of spatio-temporal feature maps input for spatio-temporal convolutional neural network layers, nOThe number of the space-time characteristic graphs output for the space-time convolution neural network layer; bjIs Ojthe bias parameter of (2); f (-) is an activation function,Representing a 3D convolution operation.
The spatiotemporal maximum pooling layer comprises the formula:
Y={ym|m=1,2,...,N}
Wherein Y isThe feature tensor of the empty max pooling layer output,M-th space-time characteristic diagram O of OmI + r of1Frame j + r2line t + r3A characteristic value of the column;Is the mth space-time feature map Y in YmThe characteristic value of jth row and tth column of ith frame; p is a radical of1,p2,p3is OmDimension (d); k is a radical of1,k2,k3The dimensionality of the pooling kernel which is the spatio-temporal maximum pooling layer.
2) and constructing a 3D-VLAD model.
The 3D-VLAD model is characterized in that three-dimensional feature graphs with different sizes output by the reciprocal n maximum pooling layers in the space-time convolution neural network model are respectively used as the input of the 3D-VLAD layer, space-time local aggregation description features on n different sizes are extracted, and each space-time local aggregation description feature viHas a dimension size of 512; these spatio-temporal local aggregations are characterized by viSpliced into a space-time local aggregation description fusion characteristic V ═ V [ n ] with the length of 512 · n1,v2,...,vi,...,vn]. The 3D-VLAD layer can capture statistical information of aggregation of local features of the three-dimensional feature map in a global video time sequence by extracting space-time local aggregation description features from the feature tensor Y output by the maximum pooling layer, and is a method for expressing the aggregated local features to the global features.
Y is a characteristic vector with dimension of NxWxH xD, W is the width of the space-time characteristic diagram, H is the height of the space-time characteristic diagram, D is the number of channels of the space-time characteristic diagram, and the 3D-VLAD model is used for:
converting Y into a characteristic diagram M with dimension L multiplied by D, and converting the characteristic diagram M into a characteristic matrix G with dimension K multiplied by D through a conversion formula, wherein the conversion formula comprises the following steps:
Z=M·W+B
A=softmax(Z)
Wherein, W and B are parameters of a full connection layer with an output neuron of K, and Z represents the output of the full connection layer; softmax (·) is a normalized exponential function, a is the output of the normalized exponential function; sum (·,1) represents the row-column summation of the matrices;representing dot product operations between matrices, ATIs a transposed matrix of the matrix A; q is a clustering center matrix parameter with dimension K multiplied by D.
the feature matrix G is transformed into a feature vector of length K · D.
the feature vector with the length of K.D passes through an L2 norm normalization layer and a full connection layer with 512 output neurons to obtain space-time local aggregation description feature visplicing a plurality of space-time local aggregation description characteristics V obtained by passing a specified number of space-time maximum pooling layers with reciprocal numbers through a 3D-VLAD layer to form a fusion characteristic vector V ═ V1,v2,...,vn]
3) And constructing a classification recognition model.
The classification recognition model is used for: and sequentially passing the space-time local aggregation description fusion characteristics V through three full-connection layers, wherein the number of the neurons of the last full-connection layer in the three full-connection layers is C, and C is the number of the video types in the model training sample set.
Determining a classification result according to an output value of the last full-connection layer and a probability formula, wherein the probability formula comprises the following steps:
Wherein, is p (o)t) Probability value, o, of the video data to be recognized belonging to the t-th classtthe t-th output value, o, representing the last fully-connected layerkA k-th output value representing the last fully-connected layer; e denotes a natural constant. p (o)t) Can show that the video data to be identified belong toprobability values for each of the C categories of video. Illustratively, the probability that the video data to be recognized belongs to the animation video is 85%, the probability that the video data to be recognized belongs to the funny video is 10%, and the probability that the video data to be recognized belongs to the literature video is 5%, wherein the highest value of the probability values can be used as a judgment standard under which the video data to be recognized belongs to the animation video. Other criteria may be used as the determination criteria, and the embodiment of the present invention is not limited herein.
Step 203, performing data expansion on the model training sample set by a data enhancement method of dynamically and randomly adjusting the extraction frame number and the extraction frame rate strategy.
in training the network model, for each batch of samples input into the network, the extraction interval is from [1,2,4 ]]in the random selection of a number mu1The number of frames is extracted from [4,8,16 ]]In the random selection of a number mu2Then to time series frame sample xiMiddle every mu1Decimate a frame to extract mu2The frame structure represents the time-series frame sample xi. Thereby achieving the purpose of improving the identification effect.
And 204, optimizing the video classification model according to a loss function and a gradient descent method by using the model training sample set as training data.
The loss function is used for describing the loss of the system under different parameter values, the loss function is minimized, the fitting degree is best, and the corresponding model parameter is the optimal parameter. When the loss function is minimized, iterative solution can be performed step by step through a gradient descent method, and the minimized loss function and the model parameter value are obtained. The method may refer to related technologies, and details are not described in the embodiments of the present invention.
And step 205, stopping optimization when the video classification model is converged.
and performing end-to-end model optimization on the whole model until the video classification model is converged, and stopping optimization.
And detecting and verifying the trained and optimized video classification model by using the test sample, and training the video classification model again according to the test result so as to improve the identification effect. Wherein the test sample may be preset video data to be identified.
The steps 201 to 205 are processes of constructing a model, and the subsequent steps 206 to 208 are processes of applying the constructed video classification model.
and step 206, acquiring the video data to be identified.
And acquiring the video data to be identified, and bringing the video data to be identified into the step 207. The video data to be identified is different from the video data in step 201, and is random video data to be identified which the user wants to classify.
And step 207, inputting the video data to be recognized into a video classification model, inputting the video data to be recognized into a space-time convolution neural network model, and then using the 3D-VLAD model to input a specified number of space-time feature maps output by a specified number of space-time maximum pooling layers with a reciprocal number in the multiple space-time maximum pooling layers to obtain space-time local aggregation description features, wherein the classification recognition model is used for obtaining a classification result according to the space-time local aggregation description features.
inputting the video data to be recognized into the model constructed in step 202, where the video classification model includes a space-time convolution neural network model, a space-time local aggregation description feature 3D-VLAD model, and a classification recognition model, which are connected in sequence, and the space-time convolution neural network model includes a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers, which are stacked in sequence. And processing the video data to be identified by using the model.
And step 208, obtaining a classification result of the video data to be recognized output by the video classification model.
The classification result of the video data to be recognized may include a probability value that the video data to be recognized belongs to each of the C-category videos. The classification result of the video data to be recognized, which is obtained from the video classification model, takes the value of each dimensionality of the space-time feature point into consideration, so that the local video information is more finely depicted, and the video classification model has good capturing capability on fine motion change. By extracting the 3D-VLAD characteristics of the characteristic diagram of the reciprocal n layer, the fusion of multi-scale characteristics is achieved, and the recognition rate can be effectively improved; the data enhancement method of dynamically and randomly adjusting the extracted frame number and the extracted interval strategy is used for dynamically adjusting the problems of inconsistent speed of action change rhythm in the video and inconsistent span length of video content, so that the generalization capability of the model is enhanced; and the video recognition method is combined with training and tuning, improves the recognition accuracy and realizes end-to-end video recognition.
In summary, the video category identification method provided by the embodiment of the present invention obtains the video data to be identified, inputs the video data to be identified into the video classification model, the video classification model comprises a space-time convolution neural network model, a space-time local aggregation description feature 3D-VLAD model and a classification identification model which are connected in sequence, wherein the space-time convolution neural network model comprises a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers which are stacked in sequence, after the video data to be identified is input into the space-time convolution neural network model, the 3D-VLAD model is used for taking a space-time feature map output by a specified number of space-time maximum pooling layers with reciprocal in the plurality of space-time maximum pooling layers as input to obtain space-time local aggregation description features, and the classification identification model is used for obtaining a classification result according to the space-time local aggregation description features; and obtaining a classification result of the video data to be identified output by the video classification model. And inputting the space-time local aggregation description characteristics of a plurality of space-time characteristic graphs into a classification recognition model to obtain a more detailed classification result. The problem of among the prior art to the capture ability of slight action change among the video data relatively poor, and then lead to the accuracy of classification result lower is solved, the effect of improvement accuracy has been reached.
Fig. 5 is a schematic structural diagram of a video category identification device according to an embodiment of the present application, and as shown in fig. 5, the video category identification device 500 includes:
And the data obtaining module 501 is configured to obtain video data to be identified.
The data processing module 502 is configured to input video data to be recognized into a video classification model, where the video classification model includes a space-time convolution neural network model, a space-time local aggregation description feature 3D-VLAD model, and a classification recognition model that are sequentially connected, the space-time convolution neural network model includes a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers that are sequentially stacked, after the video data to be recognized is input into the space-time convolution neural network model, the 3D-VLAD model is configured to obtain space-time local aggregation description features by using space-time feature maps output by a specified number of space-time maximum pooling layers of reciprocals in the plurality of space-time maximum pooling layers as inputs, and the classification recognition model is configured to obtain a classification result according to the space-time local aggregation description features.
And the result obtaining module 503 is configured to obtain a classification result of the to-be-identified video data output by the video classification model.
in summary, the video category identification apparatus provided in the embodiments of the present invention inputs the video data to be identified into the video classification model by acquiring the video data to be identified, the video classification model comprises a space-time convolution neural network model, a space-time local aggregation description feature 3D-VLAD model and a classification identification model which are connected in sequence, wherein the space-time convolution neural network model comprises a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers which are stacked in sequence, after the video data to be identified is input into the space-time convolution neural network model, the 3D-VLAD model is used for taking a space-time feature map output by a specified number of space-time maximum pooling layers with reciprocal in the plurality of space-time maximum pooling layers as input to obtain space-time local aggregation description features, and the classification identification model is used for obtaining a classification result according to the space-time local aggregation description features; and obtaining a classification result of the video data to be identified output by the video classification model. And inputting the space-time local aggregation description characteristics of a plurality of space-time characteristic graphs into a classification recognition model to obtain a more detailed classification result. The problem of among the prior art to the capture ability of slight action change among the video data relatively poor, and then lead to the accuracy of classification result lower is solved, the effect of improvement accuracy has been reached.
Fig. 6 is a schematic structural diagram of a server according to an embodiment of the present invention. Specifically, the method comprises the following steps:
The server 600 includes a Central Processing Unit (CPU) 601, a system Memory 604 including a Random Access Memory (RAM) 602 and a Read Only Memory (ROM) 603, and a system bus 605 connecting the system Memory 604 and the Central Processing Unit 601. The server 600 also includes a basic input/output system (I/O system) 606, which facilitates the transfer of information between devices within the computer, and a mass storage device 607, which stores an operating system 613, application programs 614, and other program modules 615.
The basic input/output system 606 includes a display 608 for displaying information and an input device 609 such as a mouse, keyboard, etc. for user input of information. Wherein a display 608 and an input device 609 are connected to the central processing unit 601 through an input output controller 610 connected to the system bus 605. The basic input/output system 606 may also include an input/output controller 610 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input/output controller 610 may also provide output to a display screen, a printer, or other type of output device.
the mass storage device 607 is connected to the central processing unit 601 through a mass storage controller (not shown) connected to the system bus 605. The mass storage device 607 and its associated computer-readable media provide non-volatile storage for the server 600. That is, mass storage device 607 may include a computer-readable medium (not shown) such as a hard disk or CD-ROM (compact disk Read-Only Memory) drive.
computer-readable media may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM (Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash Memory or other solid state Memory technology, CD-ROM, DVD (Digital Versatile disk), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 604 and mass storage device 607 described above may be collectively referred to as memory.
According to various embodiments of the invention, the server 600 may also operate as a remote computer connected to a network through a network, such as the Internet. That is, the server 600 may be connected to the network 612 through the network interface unit 611 connected to the system bus 605, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 611.
The memory further includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by the CPU.
The present application further provides a computer device comprising a processor and a memory, wherein at least one instruction, at least one program, code set, or instruction set is stored in the memory, and the at least one instruction, the at least one program, code set, or instruction set is loaded and executed by the processor to implement the video category identification method described above.
The present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and the video category identification apparatus executes the instructions to enable the video category identification apparatus to implement the video category identification method provided in the foregoing embodiment.
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (10)

1. A method for video category identification, the method comprising:
Acquiring video data to be identified;
Inputting the video data to be recognized into a video classification model, wherein the video classification model comprises a space-time convolution neural network model, a space-time local aggregation description feature 3D-VLAD model and a classification recognition model which are sequentially connected, the space-time convolution neural network model comprises a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers which are sequentially stacked, after the video data to be recognized is input into the space-time convolution neural network model, the 3D-VLAD model is used for taking space-time feature graphs output by a specified number of space-time maximum pooling layers of reciprocals in the plurality of space-time maximum pooling layers as input to obtain space-time local aggregation description features, and the classification recognition model is used for obtaining a classification result according to the space-time local aggregation description features;
and obtaining the classification result of the video data to be identified output by the video classification model.
2. The method of claim 1, wherein prior to obtaining the video data to be identified, the method further comprises:
Obtaining a model training sample set, wherein the model training sample set comprises a plurality of types of video sets, and each type of video set comprises a plurality of video data;
Optimizing the video classification model through the model training sample set;
Stopping optimization when the video classification model converges.
3. The method of claim 2, wherein the optimizing the video classification model by the model training sample set comprises:
and optimizing the video classification model by taking the model training sample set as training data according to a loss function and a gradient descent method.
4. The method of claim 3, wherein before optimizing the video classification model according to a loss function and a gradient descent method using the model training sample set as training data, the method further comprises:
and carrying out data expansion on the model training sample set by a data enhancement method of dynamically and randomly adjusting the frame extraction number and the frame extraction rate strategy.
5. The method of claim 1, wherein the spatio-temporal convolutional neural network layer comprises the formula:
O={Oj|j=1,2,...,nO}
wherein, IiInputting an ith space-time feature map of the I for a space-time convolution neural network layer; o is the output of the space-time convolutional neural network layer, OjThe jth spatio-temporal feature map is O; wijIs IiAnd OjA connected convolution kernel; n isINumber of spatio-temporal feature maps input for spatio-temporal convolutional neural network layers, nOThe number of the space-time characteristic graphs output for the space-time convolution neural network layer; bjis OjThe bias parameter of (2); f (-) is an activation function;
the spatiotemporal maximum pooling layer comprises the formula:
Y={ym|m=1,2,...,N}
wherein Y is the feature tensor output by the space-time maximum pooling layer,M-th space-time characteristic diagram O of OmI + r of1Frame j + r2Line t + r3A characteristic value of the column;Is the mth space-time feature map Y in YmThe characteristic value of jth row and tth column of ith frame; p is a radical of1,p2,p3Is OmDimension (d); k is a radical of1,k2,k3For maximum pooling in space-timeThe dimensionality of the pooled kernel.
6. The method of claim 5, wherein Y is a feature vector with dimensions NxWxHxD, W is a width of the spatio-temporal feature map, H is a height of the spatio-temporal feature map, and D is a number of channels of the spatio-temporal feature map, and the 3D-VLAD model is configured to:
Converting the Y into a feature map M with dimension L multiplied by D, and converting the feature map M into a feature matrix G with dimension K multiplied by D through a conversion formula, wherein the conversion formula comprises the following steps:
Z=M·W+B
A=softmax(Z)
Wherein W and B are parameters of a fully-connected layer with an output neuron of K, and Z represents the output of the fully-connected layer; softmax (·) is a normalized exponential function, a is the output of the normalized exponential function; sum (·,1) represents the row-column summation of the matrices;Representing dot product operations between matrices, ATIs a transposed matrix of the matrix A; q is a clustering center matrix parameter with dimension K multiplied by D;
Transforming the feature matrix G into a feature vector with the length of K.D;
Respectively subjecting the characteristic vectors with the length of K.D to L2 norm normalization and a layer of full connection layer to obtain the space-time local polymerization description characteristic v;
Splicing a plurality of space-time local aggregation description characteristics V obtained by passing a specified number of space-time maximum pooling layers with reciprocal numbers through a 3D-VLAD layer to form a fusion characteristic vector V ═ V1,v2,...,vn]。
7. the method of claim 6, wherein the classification recognition model is used to:
Sequentially passing the space-time local aggregation description fusion characteristic vector V through three full-connection layers, wherein the number of neurons of the last full-connection layer in the three full-connection layers is C, and C is the number of video categories in the model training sample set;
Determining the classification result by using the output value of the last full connection layer and a probability formula, wherein the probability formula comprises:
wherein, is p (o)t) Probability value o of the video data to be recognized belonging to the t-th classtThe t-th output value, o, representing the last fully-connected layerkA k-th output value representing the last fully-connected layer; e denotes a natural constant.
8. an apparatus for video category identification, the apparatus comprising:
The data acquisition module is used for acquiring video data to be identified;
The data processing module is used for inputting the video data to be recognized into a video classification model, the video classification model comprises a space-time convolution neural network model, a space-time local aggregation description feature 3D-VLAD model and a classification recognition model which are sequentially connected, the space-time convolution neural network model comprises a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers which are sequentially stacked, after the video data to be recognized is input into the space-time convolution neural network model, the 3D-VLAD model is used for taking space-time feature graphs output by a specified number of space-time maximum pooling layers of reciprocals in the plurality of space-time maximum pooling layers as input to obtain space-time local aggregation description features, and the classification recognition model is used for obtaining a classification result according to the space-time local aggregation description features;
and the result acquisition module is used for acquiring the classification result of the video data to be identified output by the video classification model.
9. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the video category identification method according to any one of claims 1 to 7.
10. A computer storage medium having stored therein instructions that, when run on a computer, cause the computer to perform the video category identification method of any one of claims 1 to 7.
CN201910862697.6A 2019-09-12 2019-09-12 Video category identification method, device, computer equipment and computer storage medium Active CN110569814B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910862697.6A CN110569814B (en) 2019-09-12 2019-09-12 Video category identification method, device, computer equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910862697.6A CN110569814B (en) 2019-09-12 2019-09-12 Video category identification method, device, computer equipment and computer storage medium

Publications (2)

Publication Number Publication Date
CN110569814A true CN110569814A (en) 2019-12-13
CN110569814B CN110569814B (en) 2023-10-13

Family

ID=68779708

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910862697.6A Active CN110569814B (en) 2019-09-12 2019-09-12 Video category identification method, device, computer equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN110569814B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111179246A (en) * 2019-12-27 2020-05-19 中国科学院上海微系统与信息技术研究所 Pixel displacement confirming method and device, electronic equipment and storage medium
CN111400551A (en) * 2020-03-13 2020-07-10 咪咕文化科技有限公司 Video classification method, electronic equipment and storage medium
CN111539289A (en) * 2020-04-16 2020-08-14 咪咕文化科技有限公司 Method and device for identifying action in video, electronic equipment and storage medium
CN111598026A (en) * 2020-05-20 2020-08-28 广州市百果园信息技术有限公司 Action recognition method, device, equipment and storage medium
CN112149736A (en) * 2020-09-22 2020-12-29 腾讯科技(深圳)有限公司 Data processing method, device, server and medium
CN113255616A (en) * 2021-07-07 2021-08-13 中国人民解放军国防科技大学 Video behavior identification method based on deep learning
CN114004645A (en) * 2021-10-29 2022-02-01 浙江省民营经济发展中心(浙江省广告监测中心) Fuse media advertisement wisdom monitoring platform and electronic equipment
CN116524240A (en) * 2023-03-30 2023-08-01 国网智能电网研究院有限公司 Electric power operation scene violation behavior identification model, method, device and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015030606A2 (en) * 2013-08-26 2015-03-05 Auckland University Of Technology Improved method and system for predicting outcomes based on spatio / spectro-temporal data
CN106845329A (en) * 2016-11-11 2017-06-13 广东顺德中山大学卡内基梅隆大学国际联合研究院 A kind of action identification method based on depth convolution feature multichannel pyramid pond
US20180053057A1 (en) * 2016-08-18 2018-02-22 Xerox Corporation System and method for video classification using a hybrid unsupervised and supervised multi-layer architecture
WO2018171109A1 (en) * 2017-03-23 2018-09-27 北京大学深圳研究生院 Video action detection method based on convolutional neural network
WO2019013711A1 (en) * 2017-07-12 2019-01-17 Mastercard Asia/Pacific Pte. Ltd. Mobile device platform for automated visual retail product recognition
CN110097000A (en) * 2019-04-29 2019-08-06 东南大学 Video behavior recognition methods based on local feature Aggregation Descriptor and sequential relationship network
CN110188653A (en) * 2019-05-27 2019-08-30 东南大学 Activity recognition method based on local feature polymerization coding and shot and long term memory network

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015030606A2 (en) * 2013-08-26 2015-03-05 Auckland University Of Technology Improved method and system for predicting outcomes based on spatio / spectro-temporal data
US20180053057A1 (en) * 2016-08-18 2018-02-22 Xerox Corporation System and method for video classification using a hybrid unsupervised and supervised multi-layer architecture
CN106845329A (en) * 2016-11-11 2017-06-13 广东顺德中山大学卡内基梅隆大学国际联合研究院 A kind of action identification method based on depth convolution feature multichannel pyramid pond
WO2018171109A1 (en) * 2017-03-23 2018-09-27 北京大学深圳研究生院 Video action detection method based on convolutional neural network
WO2019013711A1 (en) * 2017-07-12 2019-01-17 Mastercard Asia/Pacific Pte. Ltd. Mobile device platform for automated visual retail product recognition
CN110097000A (en) * 2019-04-29 2019-08-06 东南大学 Video behavior recognition methods based on local feature Aggregation Descriptor and sequential relationship network
CN110188653A (en) * 2019-05-27 2019-08-30 东南大学 Activity recognition method based on local feature polymerization coding and shot and long term memory network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DEBIAO ZHANG 等: "Diabetic Retinopathy Classification using Deeply Supervised ResNet", 《2017 IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTED, SCALABLE COMPUTING & COMMUNICATIONS, CLOUD & BIG DATA COMPUTING, INTERNET OF PEOPLE AND SMART CITY INNOVATION 》 *
罗会兰 等: "基于深度学习的视频中人体动作识别进展综述", 《电子学报》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111179246A (en) * 2019-12-27 2020-05-19 中国科学院上海微系统与信息技术研究所 Pixel displacement confirming method and device, electronic equipment and storage medium
CN111179246B (en) * 2019-12-27 2021-01-29 中国科学院上海微系统与信息技术研究所 Pixel displacement confirming method and device, electronic equipment and storage medium
CN111400551A (en) * 2020-03-13 2020-07-10 咪咕文化科技有限公司 Video classification method, electronic equipment and storage medium
CN111400551B (en) * 2020-03-13 2022-11-15 咪咕文化科技有限公司 Video classification method, electronic equipment and storage medium
CN111539289A (en) * 2020-04-16 2020-08-14 咪咕文化科技有限公司 Method and device for identifying action in video, electronic equipment and storage medium
CN111598026A (en) * 2020-05-20 2020-08-28 广州市百果园信息技术有限公司 Action recognition method, device, equipment and storage medium
CN112149736A (en) * 2020-09-22 2020-12-29 腾讯科技(深圳)有限公司 Data processing method, device, server and medium
CN112149736B (en) * 2020-09-22 2024-02-09 腾讯科技(深圳)有限公司 Data processing method, device, server and medium
CN113255616A (en) * 2021-07-07 2021-08-13 中国人民解放军国防科技大学 Video behavior identification method based on deep learning
CN114004645A (en) * 2021-10-29 2022-02-01 浙江省民营经济发展中心(浙江省广告监测中心) Fuse media advertisement wisdom monitoring platform and electronic equipment
CN116524240A (en) * 2023-03-30 2023-08-01 国网智能电网研究院有限公司 Electric power operation scene violation behavior identification model, method, device and storage medium

Also Published As

Publication number Publication date
CN110569814B (en) 2023-10-13

Similar Documents

Publication Publication Date Title
CN110569814B (en) Video category identification method, device, computer equipment and computer storage medium
WO2021093468A1 (en) Video classification method and apparatus, model training method and apparatus, device and storage medium
JP2017062781A (en) Similarity-based detection of prominent objects using deep cnn pooling layers as features
Makantasis et al. Deep learning based human behavior recognition in industrial workflows
CN112580458B (en) Facial expression recognition method, device, equipment and storage medium
CN112749666B (en) Training and action recognition method of action recognition model and related device
CN114494981B (en) Action video classification method and system based on multi-level motion modeling
CN110457523B (en) Cover picture selection method, model training method, device and medium
WO2023206944A1 (en) Semantic segmentation method and apparatus, computer device, and storage medium
CN114612414B (en) Image processing method, model training method, device, equipment and storage medium
CN117197727B (en) Global space-time feature learning-based behavior detection method and system
US20230021551A1 (en) Using training images and scaled training images to train an image segmentation model
CN111523421A (en) Multi-user behavior detection method and system based on deep learning and fusion of various interaction information
CN109492610B (en) Pedestrian re-identification method and device and readable storage medium
CN112786160A (en) Multi-image input multi-label gastroscope image classification method based on graph neural network
CN111898614B (en) Neural network system and image signal and data processing method
CN114511733A (en) Fine-grained image identification method and device based on weak supervised learning and readable medium
CN113761282A (en) Video duplicate checking method and device, electronic equipment and storage medium
CN112613442A (en) Video sequence emotion recognition method based on principle angle detection and optical flow conversion
CN115294441A (en) Robot scene recognition and analysis method integrating three characteristics by attention
CN114782995A (en) Human interaction behavior detection method based on self-attention mechanism
CN115131807A (en) Text processing method, text processing device, storage medium and equipment
CN115063831A (en) High-performance pedestrian retrieval and re-identification method and device
CN114387489A (en) Power equipment identification method and device and terminal equipment
CN113971830A (en) Face recognition method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant