US20190384985A1 - Video classification method, information processing method, and server - Google Patents

Video classification method, information processing method, and server Download PDF

Info

Publication number
US20190384985A1
US20190384985A1 US16/558,015 US201916558015A US2019384985A1 US 20190384985 A1 US20190384985 A1 US 20190384985A1 US 201916558015 A US201916558015 A US 201916558015A US 2019384985 A1 US2019384985 A1 US 2019384985A1
Authority
US
United States
Prior art keywords
video frame
feature sequence
feature
video
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US16/558,015
Other versions
US10956748B2 (en
Inventor
Yongyi TANG
Lin Ma
Wei Liu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Assigned to TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED reassignment TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: TANG, Yongyi, LIU, WEI, MA, LIN
Publication of US20190384985A1 publication Critical patent/US20190384985A1/en
Application granted granted Critical
Publication of US10956748B2 publication Critical patent/US10956748B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • G06K9/00718
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/735Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06K9/6267
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0454
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/48Matching video sequences

Definitions

  • the present disclosure relates to the field of computer technologies and, in particular, to a video classification technology.
  • a currently used video classification method mainly includes: first performing feature extraction on each video frame in a to-be-marked video, and then converting, by using an average feature method, a frame-level feature into a video-level feature, and finally, transmitting the video-level feature into a classification network for classification.
  • the embodiments of the present invention provide a video classification method, an information processing method, and a server.
  • the feature change of the video in the time dimension is also considered, so that video content can be better represented, the accuracy of video classification is improved, and the effect of video classification is improved.
  • a video classification method for a computer device includes obtaining a to-be-processed video, where the to-be-processed video has a plurality of video frames, and each video frame corresponds to one time feature; sampling the to-be-processed video according to a time-feature sampling rule, and obtaining at least one video frame feature sequence, where the time-feature sampling rule is a correspondence between time features and video frame feature sequences.
  • the method also includes processing the at least one video frame feature sequence by using a first neural network model, to obtain a feature representation result corresponding to the at least one video frame feature sequence, where the first neural network model is a recurrent neural network model; and processing the feature representation result corresponding to the at least one video frame feature sequence by using a second neural network model, to obtain a prediction result corresponding to the at least one video frame feature sequence.
  • a non-transitory computer-readable storage medium stores computer program instructions executable by at least one processor to perform: obtaining a to-be-processed video, the to-be-processed video having a plurality of video frames, and each video frame corresponding to one time feature; sampling the to-be-processed video according to a time-feature sampling rule, and obtaining at least one video frame feature sequence, the time-feature sampling rule being a correspondence between time features and video frame feature sequences; processing the at least one video frame feature sequence by using a first neural network model, to obtain a feature representation result corresponding to the at least one video frame feature sequence, the first neural network model being a recurrent neural network model; and processing the feature representation result corresponding to the at least one video frame feature sequence by using a second neural network model, to obtain a prediction result corresponding to the at least one video frame feature sequence.
  • a server in another aspect of the present disclosure, includes a memory storing computer program instructions; and a processor coupled to the memory.
  • the processor When executing the computer program instructions, the processor is configured to perform: obtaining a to-be-processed video, the to-be-processed video having a plurality of video frames, and each video frame corresponding to one time feature; sampling the to-be-processed video according to a time-feature sampling rule, and obtaining at least one video frame feature sequence, the time-feature sampling rule being a correspondence between time features and video frame feature sequences; processing the at least one video frame feature sequence by using a first neural network model, to obtain a feature representation result corresponding to the at least one video frame feature sequence, the first neural network model being a recurrent neural network model; and processing the feature representation result corresponding to the at least one video frame feature sequence by using a second neural network model, to obtain a prediction result corresponding to the at least one video frame feature sequence.
  • FIG. 1 is a schematic architectural diagram of information processing according to an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of an information processing method according to an embodiment of the present disclosure
  • FIG. 3 is a schematic diagram of a to-be-processed video according to an embodiment of the present disclosure
  • FIG. 4 is a schematic diagram of a convolutional neural network having an inception structure according to an embodiment of the present disclosure
  • FIG. 5 is a schematic structural diagram of a first neural network model according to an embodiment of the present disclosure.
  • FIG. 6 is a schematic structural diagram of a second neural network model according to an embodiment of the present disclosure.
  • FIG. 7 is a schematic diagram of a server according to an embodiment of the present disclosure.
  • FIG. 8 is a schematic diagram of another server according to an embodiment of the present disclosure.
  • FIG. 9 is a schematic diagram of another server according to an embodiment of the present disclosure.
  • FIG. 10 is a schematic diagram of another server according to an embodiment of the present disclosure.
  • FIG. 11 is a schematic diagram of another server according to an embodiment of the present disclosure.
  • FIG. 12 is a schematic diagram of another server according to an embodiment of the present disclosure.
  • FIG. 13 is a schematic diagram of another server according to an embodiment of the present disclosure.
  • FIG. 14 is a schematic diagram of another server according to an embodiment of the present disclosure.
  • FIG. 15 is a schematic structural diagram of a server according to an embodiment of the present disclosure.
  • the embodiments of the present disclosure provide a video classification method, an information processing method, and a server.
  • the feature change of the video in a time dimension is also considered, so that video content can be better represented, the accuracy of video classification is improved, and the effect of video classification is improved.
  • a process, method, system, product, or device that includes a series of steps or units is not necessarily limited to those steps or units that are expressly listed, but may include other steps or units not expressly listed or inherent to such a process, method, product, or device.
  • solutions in the present disclosure may be mainly used to provide a video content classification service.
  • a backend computer device performs feature extraction, time sequence modeling, and feature compression on a video, and finally classifies video features by using a mixed expert model, so that automatic classification and labeling on the video are implemented on the computer device.
  • Such solutions may be deployed on a video type website to add key words for videos on the video type website, and quick search and content matching can also be facilitated, and video personalized recommendation can also be facilitated.
  • FIG. 1 is a schematic architectural diagram of information processing according to an embodiment of the present disclosure.
  • a computer device obtains a to-be-processed video. It can be learned from FIG. 1 that the to-be-processed video includes a plurality of video frames, and each video frame corresponds to a time feature, and different time features may be represented by t.
  • the computer device processes each video frame in the to-be-processed video by using a convolutional neural network, to obtain a time feature corresponding to each video frame.
  • the computer device determines a time feature sequence of the to-be-processed video according to the time feature corresponding to each video frame.
  • the time feature sequence is deep learning representation at the frame level.
  • the computer device may sample the to-be-processed video according to a time-feature sampling rule.
  • the time-feature sampling rule refers to sampling video features at different frame rates in a time dimension, and obtaining at least one video frame feature sequence.
  • the video frame feature sequences correspond to different time scales.
  • the computer device inputs video frame feature sequences corresponding to different time scales into bidirectional recurrent neural networks, respectively, to obtain a feature representation result corresponding to at least one video frame feature sequence.
  • the feature representation result is video feature representation in a time scale.
  • the computer device inputs all feature representation results into a second neural network, namely, the mixed expert model, and obtains a prediction result corresponding to each video frame feature sequence, and can determine a category of the to-be-processed video according to the prediction results, to classify the to-be-processed video.
  • a second neural network namely, the mixed expert model
  • a user In common video data, a user usually describes and comments video information and provides personalized label data, to form rich text information related to online videos.
  • the text information may also be used as basis for video classification.
  • the information processing method in the present disclosure by using a server as an execution entity. It is to be understood that the information processing method in the present disclosure not only can be applied to the server, but also can be applied to any other computer device. Referring to FIG. 2 , the information processing method in one embodiment of the present disclosure includes:
  • the server first obtains the to-be-processed video.
  • FIG. 3 is a schematic diagram of the to-be-processed video in one embodiment of the present disclosure.
  • the to-be-processed video includes a plurality of video frames.
  • each picture in FIG. 3 is a video frame, and each video frame corresponds to a time feature.
  • the to-be-processed video corresponds to a period of play time. Therefore, each video frame corresponds to a different play moment. Assuming that a time feature of a first video frame in the to-be-processed video is “1”, and a time feature of a second video frame is “2”, by this analogy, a time feature of a Tth video frame is “T”.
  • 102 Sample the to-be-processed video according to a time-feature sampling rule, and obtain at least one video frame feature sequence, the time-feature sampling rule being a correspondence between a time feature and a video frame feature sequence.
  • the server samples the to-be-processed video according to the time-feature sampling rule.
  • the time-feature sampling rule includes a preset relationship between a time feature and a video frame feature sequence.
  • one video frame feature sequence may be obtained, or video frame feature sequences of at least two different time scales may be obtained.
  • the number of time features corresponding to each included video frame feature is different, and correspondingly, lengths of the video frame feature sequences corresponding to different time scales are also different.
  • one to-be-processed video has a total of 1000 video frames, and the 1000 video frames respectively correspond to 1 to 1000 time features.
  • the time-feature sampling rule is that each time feature corresponds to one video frame feature
  • 1000 time features of the to-be-processed video correspond to 1000 video frame features.
  • the length of the video frame feature sequence formed by the 1000 video frame features is 1000.
  • the time-feature sampling rule is that every 100 time features correspond to one video frame feature
  • the 1000 time features of the to-be-processed video correspond to 10 video frame features.
  • the length of the video frame feature sequence formed by the 10 video frames is 10, and so on, and details are not described herein.
  • the server may separately input video frame feature sequences corresponding to different time scales into the first neural network model.
  • the first neural network model is a recurrent neural network model. Then the first neural network model recurses the input at least one video frame feature sequence, and correspondingly outputs a feature representation result of each video frame feature sequence.
  • Different time scales are different lengths of the video frame feature sequences. As described in Step 102 , assuming that the total length of the video is T, if each time feature corresponds to a video frame feature, the length of the video frame feature sequence is T/1. If every 10 time-features correspond to a video frame feature, the length of the video frame feature sequence is T/10.
  • the server may separately input the feature representation result corresponding to each video frame feature sequence into the second neural network model, and then after processing each input feature representation result by using the second neural network model, the server outputs the prediction result corresponding to each feature representation result. Finally, the server may determine the category of the to-be-processed video according to the prediction result.
  • the category of the to-be-processed video may be “sports”, “news”, “music”, “animation”, “game”, or the like, and is not limited herein.
  • an information processing method is provided.
  • the server obtains the to-be-processed video, the to-be-processed video including a plurality of video frames, and each video frame corresponding to a time feature, and then samples the to-be-processed video according to the time-feature sampling rule, and obtains the at least one video frame feature sequence, the time-feature sampling rule being the correspondence between the time feature and the video frame feature sequence.
  • the server then inputs the at least one video frame feature sequence into the first neural network model, to obtain the feature representation result corresponding to the at least one video frame feature sequence, the first neural network model being a recurrent neural network model.
  • the server inputs the feature representation result corresponding to the at least one video frame feature sequence into the second neural network model, to obtain the prediction result corresponding to each video frame feature sequence, the prediction result being used to determine the category of the to-be-processed video.
  • a feature change of the video in a time dimension is also considered, so that video content can be better represented, the accuracy of video classification is improved, and the effect of video classification is improved.
  • the method may further include: processing each video frame in the to-be-processed video by using a convolutional neural network CNN, to obtain a time feature corresponding to each video frame; and determining a time feature sequence of the to-be-processed video according to the time feature corresponding to each video frame, the time feature sequence being configured for sampling.
  • the server may process each video frame in the to-be-processed video by using a convolutional neural network (CNN) having an inception structure, and then extract a time feature corresponding to each video frame. Finally, the server determines the time feature sequence of the to-be-processed video according to the time feature of each video frame. Assuming that the first video frame of the to-be-processed video is 1, and the second video frame is 2, by this analogy, the last video frame is T, it may be determined that the time feature sequence of the to-be-processed video is T (seconds).
  • CNN convolutional neural network
  • FIG. 4 is a schematic diagram of a convolutional neural network having an inception structure in an embodiment of the present disclosure.
  • the inception structure includes convolutions of three different sizes, namely, a 1 ⁇ 1 convolutional layer, a 3 ⁇ 3 convolutional layer, and a 5 ⁇ 5 convolutional layer, and in a 3 ⁇ 3 maximum pooling layer, a final fully-connected layer is removed, and a global average pooling layer (the size of the picture is changed to 1 ⁇ 1) is used to replace the fully-connected layer.
  • a network depth and a network width may be increased.
  • free parameters also need to be reduced. Therefore, in a same layer of the inception structure, there are three different convolution templates for convolving the 1 ⁇ 1 convolutional layer, the 3 ⁇ 3 convolutional layer, and the 5 ⁇ 5 convolutional layer.
  • Feature extraction may be performed on the three convolution templates in different sizes, and the three convolution templates are also a mixed model, because the maximum pooling layer also has the function of feature extraction, and different from convolution, no overfitting is performed without parameters, and the maximum pooling layer is used as a branch.
  • 1 ⁇ 1 convolution is first performed before 3 ⁇ 3 convolution and 5 ⁇ 5 convolution, to reduce the number of input channels, so that the network is deepened, and the calculation amount is reduced.
  • the server may further process each video frame in the to-be-processed video by using a convolutional neural network, and obtain a time feature corresponding to each video frame.
  • the time features are used to form a time feature sequence of the entire to-be-processed video.
  • each video frame is trained and processed by using the convolutional neural network, to facilitate improvement of the accuracy and effect of time feature extraction.
  • the sampling the to-be-processed video according to a time-feature sampling rule, and obtaining at least one video frame feature sequence may include: determining at least one time window according to the time-feature sampling rule, each time window including at least one video frame of the to-be-processed video; and extracting, from the time feature sequence, a video frame feature sequence corresponding to each time window.
  • how the server obtains the at least one video frame feature sequence is described below.
  • the size of the time window may be predefined manually. A larger number of video frames in one time-window indicates a larger granularity. An averaging operation is performed on content in each time-window, so that the content becomes content of “one frame”.
  • a method for extracting video frame feature sequences in different time scales is described. Namely, at least one time-window is first determined according to the time-feature sampling rule, and each time-window includes at least one video frame in the to-be-processed video, and then a video frame feature sequence corresponding to each time-window is extracted from the time feature sequence.
  • video frame feature sequences in different scales can be obtained, to obtain a plurality of different samples for feature training. In this way, the accuracy of a video classification result is improved.
  • the processing the at least one video frame feature sequence by using a first neural network model, to obtain a feature representation result corresponding to each video frame feature sequence may include: inputting the at least one video frame feature sequence into a forward recurrent neural network in the first neural network model, to obtain a first representation result; inputting the at least one video frame feature sequence into a backward recurrent neural network in the first neural network model, to obtain a second representation result; and calculating a feature representation result corresponding to the at least one video frame feature sequence according to the first representation result and the second representation result.
  • FIG. 5 is a schematic structural diagram of a first neural network model in an embodiment of the present disclosure.
  • the entire first neural network model includes two parts, namely, a forward recurrent neural network and a backward recurrent neural network, and each video frame feature sequence is input into the forward recurrent neural network, and then a corresponding first representation result is outputted. Meanwhile, each video frame feature sequence is input into the backward recurrent neural network, and then a corresponding second representation result is outputted.
  • a feature representation result corresponding to the video frame feature sequence can be obtained by directly splicing the first representation result and the second representation result.
  • time sequence modeling may be performed on the video frame feature sequence by using a recurrent gate unit based recurrent neural network.
  • a first neural network model can also be used to perform video feature compression.
  • a bidirectional recurrent neural network is used to perform feature compression and representation respectively from forward and backward directions toward a time center point location of the to-be-processed video. In this way, the operability of the solution is improved.
  • the calculating a feature representation result corresponding to the at least one video frame feature sequence according to the first representation result and the second representation result may include calculating the feature representation result corresponding to the at least one video frame feature sequence by using the following formulas:
  • h t f GRU ( x t ,h t ⁇ 1 f ) for t ⁇ [1, T/ 2];
  • h t b GRU ( x t ,h t+1 b ) for t ⁇ [1, T/ 2]
  • h represents a feature representation result of a video frame feature sequence
  • h T/2 f represents the first representation result
  • h T/2 b represents the second representation result
  • x t represents the video frame feature sequence at a tth moment
  • GRU ( ) represents use of a gated recurrent unit GRU for neural network processing
  • T represents total time of the to-be-processed video
  • t represents an integer in a range of 1 to T.
  • a bidirectional recurrent neural network may be used to perform feature compression and representation respectively from forward and backward directions toward a video time center point location. Specifically, for a video frame feature sequence x t of a particular scale, t ⁇ [1,T].
  • the forward recurrent neural network is:
  • h t b GRU ( x t ,h t+1 b ) for t ⁇ [ T,T/ 2]
  • h t f is an intermediate layer feature representation in the forward recurrent neural network, or may be represented as a first representation result h T/2 f .
  • h t b is an intermediate layer feature representation of the backward recurrent neural network, or may be represented as a second representation result h T/2 b .
  • GRU ( ) is a recurrent gate unit function, and has a specific form of:
  • h t z t ⁇ h t ⁇ 1 (1 ⁇ z t ) ⁇ h ( W t x t +U h ( r t ⁇ h t ⁇ 1 )+ b h );
  • ⁇ g represents a sigmoid function
  • ⁇ h represents an arc-tangent function
  • W z , W r , W t , U z , U r , and U h are all linear transformation parameter matrices, and different subscripts respectively represent different “gates”
  • b z , b r , and b h are offset parameter vectors, and ° represents calculation of a compound function.
  • the first representation result and the second representation result may be spliced, to obtain a feature representation result corresponding to a scale, namely,
  • h [ h T/2 f ,h T/2 b ].
  • the prediction result can be obtained by calculation by using related formulas, to provide feasible manners for implementation of the solution, thereby improving the feasibility and operability of the solution.
  • the processing the feature representation result corresponding to the at least one video frame feature sequence by using a second neural network model, to obtain a prediction result corresponding to the at least one video frame feature sequence may include: inputting the feature representation result corresponding to the at least one video frame feature sequence into a first sub-model in the second neural network model, to obtain a third representation result; inputting the feature representation result corresponding to the at least one video frame feature sequence into a second sub-model in the second neural network model, to obtain a fourth representation result; and calculating the prediction result corresponding to the at least one video frame feature sequence according to the third representation result and the fourth representation result.
  • FIG. 6 is a schematic structural diagram of a second neural network model in an embodiment of the present disclosure.
  • the entire second neural network model includes two parts: respectively a first sub-model and a second sub-model.
  • the first sub-model may also be referred to as “gate representation”, and the second sub-model may also be referred to as “activation representation”.
  • a feature representation result corresponding to each video frame feature sequence is input to the “gate representation”, and then a corresponding third representation result is outputted.
  • a feature representation result corresponding to each video frame feature sequence is input to the “activation representation”, and then a corresponding fourth representation result is outputted.
  • Each third representation result is multiplied by each fourth representation result, and then addition is performed, to obtain a prediction result of the video frame feature sequence.
  • the second neural network model may be further used to classify the feature representation result.
  • non-linear transformation may be performed on the feature representation result to obtain gate representation and activation representation respectively, and then a multiplication operation is performed on the two paths of representations and addition is performed, to obtain a final feature representation for classification, thereby facilitating improvement of the classification accuracy.
  • the calculating the prediction result corresponding to the at least one video frame feature sequence according to the third representation result and the fourth representation result may include: calculating the prediction result corresponding to the at least one video frame feature sequence by using the following formulas:
  • lable represents a prediction result of a video frame feature sequence
  • g n represents the third representation result
  • a n represents the fourth representation result
  • ⁇ g represents a softmax function
  • ⁇ a represents a sigmoid function
  • h represents the feature representation result of the video frame feature sequence
  • W g and b g represent parameters in the first sub-model
  • W a and b a represent parameters of the second sub-model
  • N represents a calculation total number obtained after non-linear transformation is performed on the feature representation result
  • n represents an integer in a range of 1 to N.
  • N paths of gate representation and activation representation are obtained by performing non-linear transformation on the feature representation result, and then a third representation result g n corresponding to the gate representation is calculated, and a fourth representation result a n corresponding to the activation representation is calculated.
  • the order in which the third representation result g n and the fourth representation result a n are calculated is not limited herein.
  • a multiplication operation is performed, and then an addition operation is performed, to obtain a prediction result of a video frame feature sequence.
  • the prediction result can be obtained by calculation by using related formulas, to provide feasible manners for implementation of the solution, thereby improving the feasibility and operability of the solution.
  • the method may further include: calculating the category of the to-be-processed video according to the prediction result corresponding to the at least one video frame feature sequence and a weight value corresponding to the at least one video frame feature sequence; and classifying the to-be-processed video according to the category of the to-be-processed video.
  • the server may further calculate the category of the to-be-processed video according to the prediction result corresponding to each video frame feature sequence, and a weight value corresponding to each video frame feature sequence, and classify the to-be-processed video according to a classification result.
  • a maximum number of prediction results is five, and the prediction result is indicated by code “0 and 1” with a length of 5.
  • the code of the prediction result 1 is 00001
  • the code of the prediction result 3 is 00100
  • the to-be-processed video is indicated as 00101.
  • each prediction result is not greater than 1, and the prediction result may indicate a possibility that the to-be-processed video belongs to the category.
  • ⁇ 0.01, 0.02, 0.9, 0.005, 1.0 ⁇ is a reasonable prediction result, and it means that a probability that the to-be-processed video belongs to the first category is 1.0, namely, 100%, a probability that the to-be-processed video belongs to the second category is 0.005, namely, 0.5%, a probability that the to-be-processed video belongs to the third category is 0.9, namely, 90%, a probability that the to-be-processed video belongs to the fourth category is 0.02, namely, 2%, and a probability that the to-be-processed video belongs to the fifth category is 0.01, namely, 1%.
  • the prediction result is calculated by using a preset weight value, and calculation may be performed by using a weighted algorithm.
  • Each weight value is learned by using linear regression, and is a value, and indicates the importance of each video frame feature sequence, and a sum of weight values is 1, for example, ⁇ 0.1, 0.4, 0.5 ⁇ . How to calculate the category of the to-be-processed video is specifically described below.
  • the category of the to-be-processed video is indicated as:
  • the probability that the to-be-processed video belongs to the third category is largest, and the probability that the to-be-processed video belongs to the first category is the second largest. Therefore, the to-be-processed video is displayed in a video list of the third category in priority.
  • the server may further calculate the category of the to-be-processed video according to the prediction result corresponding to each video frame feature sequence and the weight value corresponding to each video frame feature sequence, and finally classify the to-be-processed video according to the category of the to-be-processed video.
  • the prediction result refers to the time feature
  • the video classification capability can be improved, to implement personalized recommendation, and facilitate better practicability.
  • FIG. 7 is a schematic diagram of an embodiment of a server in an embodiment of the present disclosure.
  • the server 20 includes: a first obtaining module 201 , a second obtaining module 202 , a first input module 203 , and a second input module 204 .
  • the first obtaining module 201 is configured to obtain a to-be-processed video, the to-be-processed video including a plurality of video frames, and each video frame corresponding to a time feature.
  • the second obtaining module 202 is configured to: sample, according to a time-feature sampling rule, the to-be-processed video obtained by the first obtaining module 201 , and obtain at least one video frame feature sequence, the time-feature sampling rule being a correspondence between a time feature and a video frame feature sequence.
  • the first input module 203 is configured to process, by using a first neural network model, the at least one video frame feature sequence obtained by the second obtaining module 202 , to obtain a feature representation result corresponding to the at least one video frame feature sequence, the first neural network model being a recurrent neural network model.
  • the second input module 204 is configured to process, by using a second neural network model, the feature representation result that corresponds to the at least one video frame feature sequence and that is obtained by the first input module 203 , to obtain a prediction result corresponding to the at least one video frame feature sequence, the prediction result being used to determine a category of the to-be-processed video.
  • the first obtaining module 201 obtains the to-be-processed video, the to-be-processed video including a plurality of video frames, and each video frame corresponding to a time feature; the second obtaining module 202 samples, according to the time-feature sampling rule, the to-be-processed video obtained by the first obtaining module 201 , and obtains the at least one video frame feature sequence, the time-feature sampling rule being the correspondence between the time feature and the video frame feature sequence.
  • the first input module 203 processes, by using the first neural network model, the at least one video frame feature sequence obtained by the second obtaining module 202 , to obtain the feature representation result corresponding to the at least one video frame feature sequence, the first neural network model being a recurrent neural network model.
  • the second input module 204 processes, by using the second neural network model, the feature representation result corresponding to the at least one video frame feature sequence obtained by the first input module 203 , to obtain the prediction result corresponding to the at least one video frame feature sequence, the prediction result being used to determine the category of the to-be-processed video.
  • a server obtains the to-be-processed video, the to-be-processed video including a plurality of video frames, and each video frame corresponding to a time feature, and then samples the to-be-processed video according to the time-feature sampling rule, and obtains the at least one video frame feature sequence, the time-feature sampling rule being the correspondence between the time feature and the video frame feature sequence.
  • the server then inputs the at least one video frame feature sequence into the first neural network model, to obtain the feature representation result corresponding to each video frame feature sequence.
  • the server inputs the feature representation result corresponding to each video frame feature sequence into the second neural network model, to obtain the prediction result corresponding to each video frame feature sequence, the prediction result being used to determine the category of the to-be-processed video.
  • a feature change of the video in a time dimension is also considered, so that video content can be better represented, the accuracy of video classification is improved, and the effect of video classification is improved.
  • the server 20 further includes a processing module 205 , and a determining module 206 .
  • the processing module 205 is configured to process each video frame in the to-be-processed video by using a convolutional neural network CNN, to obtain a time feature corresponding to each video frame after the first obtaining module 201 obtains the to-be-processed video.
  • the determining module 206 is configured to determine a time feature sequence of the to-be-processed video according to the time feature corresponding to each video frame processed by the processing module 205 , the time feature sequence being configured for sampling.
  • the server may further process each video frame in the to-be-processed video by using a convolutional neural network, and obtain a time feature corresponding to each video frame.
  • the time features are used to form a time feature sequence of the entire to-be-processed video.
  • each video frame is trained and processed by using the convolutional neural network, to facilitate improvement of the accuracy and effect of time feature extraction.
  • the second obtaining module 202 includes: a determining unit 2021 configured to determine at least one time-window according to the time-feature sampling rule, each time-window including at least one video frame of the to-be-processed video; and an extraction unit 2022 configured to extract, from the time feature sequence, a video frame feature sequence corresponding to each time-window determined by the determining unit 2021 .
  • a method for extracting video frame feature sequences in different scales is described. That is, at least one time-window is first determined according to the time-feature sampling rule, and each time-window includes at least one video frame in the to-be-processed video, and then a video frame feature sequence corresponding to each time-window is extracted from the time feature sequence.
  • video frame feature sequences in different scales can be obtained, to obtain a plurality of different samples for feature training. In this way, the accuracy of a video classification result is improved.
  • the first input module 203 includes: a first obtaining unit 2031 , a second obtaining unit 2032 , and a first calculation unit 2033 .
  • the first obtaining unit 2031 is configured to input the at least one video frame feature sequence into a forward recurrent neural network in the first neural network model, to obtain a first representation result.
  • the second obtaining unit 2032 is configured to input each video frame feature sequence into a backward recurrent neural network in the first neural network model, to obtain a second representation result.
  • the first calculation unit 2033 is configured to calculate a feature representation result corresponding to the at least one video frame feature sequence according to the first representation result obtained by the first obtaining unit 2031 and the second representation result obtained by the second obtaining unit 2032 .
  • time sequence modeling may be performed on the video frame feature sequence by using a recurrent gate unit based recurrent neural network.
  • a first neural network model can also be used to perform video feature compression.
  • a bidirectional recurrent neural network is used to perform feature compression and representation respectively from forward and backward directions toward a time center location of the to-be-processed video. In this way, the operability of the solution is improved.
  • the first calculation unit 2033 includes: a first calculation subunit 20331 , configured to calculate the feature representation result corresponding to the at least one video frame feature sequence by using the following formulas:
  • h t f GRU ( x t ,h t ⁇ 1 f ) for t ⁇ [1, T/ 2];
  • h t b GRU ( x t ,h t+1 b ) for t ⁇ [1, T/ 2]
  • h represents a feature representation result of a video frame feature sequence
  • h T/2 f represents the first representation result
  • h T/2 b represents the second representation result
  • x t represents the video frame feature sequence at a tth moment
  • GRU ( ) represents use of a gated recurrent unit GRU for neural network processing
  • T represents total time of the to-be-processed video
  • t represents an integer in a range of 1 to T.
  • the prediction result can be obtained by calculation by using related formulas, to provide feasible manners for implementation of the solution, thereby improving the feasibility and operability of the solution.
  • the second input module 204 includes: a third obtaining unit 2041 , a fourth obtaining unit 2042 , and a second calculation unit 2043 .
  • the third obtaining unit 2041 is configured to input the feature representation result corresponding to each video frame feature sequence into a first sub-model in the second neural network model, to obtain a third representation result.
  • the fourth obtaining unit 2042 is configured to input the feature representation result corresponding to each video frame feature sequence into a second sub-model in the second neural network model, to obtain a fourth representation result.
  • the second calculation unit 2043 is configured to calculate a prediction result corresponding to each video frame feature sequence according to the third representation result obtained by the third obtaining unit 2041 and the fourth representation result obtained by the fourth obtaining unit 2042 .
  • the second neural network model may be further used to classify the feature representation result.
  • non-linear transformation may be performed on the feature representation result to obtain gate representation and activation representation respectively, and then a multiplication operation is performed on the two paths of representations and addition is performed, to obtain a final feature representation for classification, thereby facilitating improvement of the classification accuracy.
  • the second calculation unit 2043 includes a second calculation subunit 20431 , configured to calculate the prediction result corresponding to each video frame feature sequence by using the following formulas:
  • lable represents a prediction result of a video frame feature sequence
  • g n represents the third representation result
  • a n represents the fourth representation result
  • c g represents a softmax function
  • ⁇ a represents a sigmoid function
  • h represents the feature representation result of the video frame feature sequence
  • W g and b g represent parameters in the first sub-model
  • W a and b a represent parameters of the second sub-model
  • N represents a calculation total number obtained after non-linear transformation is performed on the feature representation result
  • n represents an integer in a range of 1 to N.
  • the prediction result can be obtained by calculation by using related formulas, to provide feasible manners for implementation of the solution, thereby improving the feasibility and operability of the solution.
  • the server 20 further includes a calculation module 207 and a classification module 208 .
  • the calculation module 207 is configured to calculate the category of the to-be-processed video according to the prediction result corresponding to the at least one video frame feature sequence and a weight value corresponding to the at least one video frame feature sequence after the second input module 204 processes, by using the second neural network model, the feature representation result corresponding to the at least one video frame feature sequence, to obtain the prediction result corresponding to the at least one video frame feature sequence.
  • the classification module 208 is configured to classify the to-be-processed video according to the category that is of the to-be-processed video and that is calculated by the calculation module 207 .
  • the server may further calculate the category of the to-be-processed video according to the prediction result corresponding to each video frame feature sequence and the weight value corresponding to each video frame feature sequence, and finally classify the to-be-processed video according to the category of the to-be-processed video.
  • the prediction result refers to the time feature
  • the video classification capability can be improved, to implement personalized recommendation, and facilitate better practicability.
  • FIG. 15 is a schematic structural diagram of a server according to an embodiment of the present disclosure.
  • the server 300 may vary greatly due to different configurations or performance, and may include one or more central processing units (CPU) 322 (for example, one or more processors) and a memory 332 , and one or more storage media 330 (for example, one or more mass storage devices) that store an application program 342 or data 344 .
  • the memory 332 and the storage medium 330 may be transient storages or persistent storages.
  • the program stored in the storage medium 330 may include one or more modules (which is not marked in the figure), and each module may include a series of instruction and operations to the server.
  • the central processing unit 322 may be configured to communicate with the storage medium 330 , and perform, on the server 300 , a series of instructions and operations in the storage medium 330 .
  • the server 300 may further include one or more power supplies 326 , one or more wired or wireless network interfaces 350 , one or more input/output interfaces 358 , and/or one or more operating systems 341 , for example, Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, and FreeBSDTM.
  • one or more power supplies 326 may further include one or more power supplies 326 , one or more wired or wireless network interfaces 350 , one or more input/output interfaces 358 , and/or one or more operating systems 341 , for example, Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, and FreeBSDTM.
  • the CPU 322 included in the server has the following functions: obtaining a to-be-processed video, the to-be-processed video including a plurality of video frames, and each video frame corresponding to a time feature; sampling the to-be-processed video according to a time-feature sampling rule, and obtaining at least one video frame feature sequence, the time-feature sampling rule being a correspondence between a time feature and a video frame feature sequence; processing the at least one video frame feature sequence by using a first neural network model, to obtain a feature representation result corresponding to the at least one video frame feature sequence, the first neural network model being a recurrent neural network model; and processing the feature representation result corresponding to the at least one video frame feature sequence by using a second neural network model, to obtain a prediction result corresponding to the at least one video frame feature sequence, the prediction result being used to determine a category of the to-be-processed video.
  • the CPU 322 is further configured to execute the following operations: processing each video frame in the to-be-processed video by using a convolutional neural network CNN, to obtain a time feature corresponding to each video frame; and determining a time feature sequence of the to-be-processed video according to the time feature corresponding to each video frame, the time feature sequence being configured for sampling.
  • the CPU 322 is specifically configured to execute the following operations: determining at least one time-window according to the time-feature sampling rule, each time-window including at least one video frame of the to-be-processed video; and extracting, from the time feature sequence, a video frame feature sequence corresponding to each time-window.
  • the CPU 322 is specifically configured to execute the following operations: inputting the at least one video frame feature sequence into a forward recurrent neural network in the first neural network model, to obtain a first representation result; inputting the at least one video frame feature sequence into a backward recurrent neural network in the first neural network model, to obtain a second representation result; and calculating a feature representation result corresponding to the at least one video frame feature sequence according to the first representation result and the second representation result.
  • the CPU 322 is specifically configured to execute the following step: calculating the feature representation result corresponding to the at least one video frame feature sequence by using the following formulas:
  • h t f GRU ( x t ,h t ⁇ 1 f ) for t ⁇ [1, T/ 2];
  • h t b GRU ( x t ,h t+1 b ) for t ⁇ [1, T/ 2]
  • h represents a feature representation result of a video frame feature sequence
  • h T/2 f represents the first representation result
  • h T/2 b represents the second representation result
  • x t represents the video frame feature sequence at a tth moment
  • GRU ( ) represents use of a gated recurrent unit GRU for neural network processing
  • T represents total time of the to-be-processed video
  • t represents an integer in a range of 1 to T.
  • the CPU 322 is specifically configured to execute the following operations: inputting the feature representation result corresponding to the at least one video frame feature sequence into a first sub-model in the second neural network model, to obtain a third representation result; inputting the feature representation result corresponding to the at least one video frame feature sequence into a second sub-model in the second neural network model, to obtain a fourth representation result; and calculating the prediction result corresponding to the at least one video frame feature sequence according to the third representation result and the fourth representation result.
  • the CPU 322 is specifically configured to execute the following step: calculating the prediction result corresponding to the at least one video frame feature sequence by using the following formulas:
  • lable represents a prediction result of a video frame feature sequence
  • g n represents the third representation result
  • a n represents the fourth representation result
  • ⁇ g represents a softmax function
  • ⁇ a represents a sigmoid function
  • h represents the feature representation result of the video frame feature sequence
  • W g and b g represent parameters in the first sub-model
  • W a and b a represent parameters of the second sub-model
  • N represents a calculation total number obtained after non-linear transformation is performed on the feature representation result
  • n represents an integer in a range of 1 to N.
  • the CPU 322 is further configured to execute the following operations: calculating the category of the to-be-processed video according to the prediction result corresponding to the at least one video frame feature sequence and a weight value corresponding to the at least one video frame feature sequence; and classifying the to-be-processed video according to the category of the to-be-processed video.
  • An embodiment of the present disclosure further provides a storage medium, for storing program code, the program code being configured to execute any implementation of the information processing method according to the foregoing embodiments.
  • implementation may be entirely or partially performed by using software, hardware, firmware or any combination thereof.
  • implementation may be entirely or partially performed in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or another programmable apparatus.
  • the computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium.
  • the computer instructions may be transmitted from one website, computer, server or data center to another website, computer, server or data center in a wired (for example, a coaxial cable, an optical fiber or a digital subscriber line (DSL)) or wireless (for example, infrared, wireless or microwave) manner.
  • a wired for example, a coaxial cable, an optical fiber or a digital subscriber line (DSL)
  • wireless for example, infrared, wireless or microwave
  • the computer readable storage medium may be any available medium capable of being accessed by a computer or include one or more data storage devices integrated by an available medium, such as a server and a data center.
  • the available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium (such as a digital versatile disc (DVD)), a semiconductor medium (such as a solid-state disk (SSD)) or the like.
  • the disclosed system, apparatus, and method may be implemented in other manners.
  • the described apparatus embodiment is merely an example.
  • the unit division is merely logical function division and may be other division during actual implementation.
  • a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed.
  • the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented by using some interfaces.
  • the indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.
  • the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some of or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.
  • the integrated units may be implemented in a form of hardware or may be implemented in a form of a software functional unit.
  • the integrated unit When the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, the integrated unit may be stored in a computer-readable storage medium.
  • the computer software product is stored in a storage medium and includes several instructions for instructing a computer device (which may be a personal computer, a server, a network device, or the like) to perform all or some of the steps of the methods described in the embodiments of the present disclosure.
  • the foregoing storage medium includes: any medium that can store program code, such as a USB flash memory drive, a removable hard disk, a read-only memory (ROM), a random-access memory (RAM), a magnetic disk, or an optical disc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Library & Information Science (AREA)
  • Algebra (AREA)
  • Image Analysis (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

A video classification method is provided for a computer device. The method includes obtaining a to-be-processed video, where the to-be-processed video has a plurality of video frames, and each video frame corresponds to one time feature; and sampling the to-be-processed video according to a time-feature sampling rule, and obtaining at least one video frame feature sequence. The time-feature sampling rule is a correspondence between time features and video frame feature sequences. The method also includes processing the video frame feature sequence by using a first neural network model, to obtain a feature representation result corresponding to the video frame feature sequence, where the first neural network model is a recurrent neural network model; and processing the feature representation result by using a second neural network model, to obtain a prediction result corresponding to the video frame feature sequence.

Description

    RELATED APPLICATIONS
  • This application is a continuation application of PCT Patent Application No. PCT/CN2018/100733, filed on Aug. 16, 2018, which claims priority to Chinese Patent Application No. 2017108336688, filed with the China National Intellectual Property Administration on Sep. 15, 2017 and entitled “VIDEO CLASSIFICATION METHOD, INFORMATION PROCESSING METHOD, AND SERVER”, which is incorporated herein by reference in its entirety.
  • FIELD OF THE TECHNOLOGY
  • The present disclosure relates to the field of computer technologies and, in particular, to a video classification technology.
  • BACKGROUND OF THE DISCLOSURE
  • With the rapid development of network multimedia technologies, various multimedia information constantly emerges. A growing number of users are used to watching videos on the network. To enable the users to select, from a large number of videos, content that the users want to watch, a server usually classifies the videos, and video classification has great significance for implementing video management and interest recommendation.
  • A currently used video classification method mainly includes: first performing feature extraction on each video frame in a to-be-marked video, and then converting, by using an average feature method, a frame-level feature into a video-level feature, and finally, transmitting the video-level feature into a classification network for classification.
  • However, in the current video classification method, converting the frame-level feature only by using the average feature method is relatively simple and, during the video classification process, the impact of changes in other dimensions on video frame feature conversion is often ignored, and the accuracy of video classification is undesired.
  • SUMMARY
  • The embodiments of the present invention provide a video classification method, an information processing method, and a server. In the process of classifying a video, the feature change of the video in the time dimension is also considered, so that video content can be better represented, the accuracy of video classification is improved, and the effect of video classification is improved.
  • In one aspect of the present disclosure, a video classification method for a computer device. The method includes obtaining a to-be-processed video, where the to-be-processed video has a plurality of video frames, and each video frame corresponds to one time feature; sampling the to-be-processed video according to a time-feature sampling rule, and obtaining at least one video frame feature sequence, where the time-feature sampling rule is a correspondence between time features and video frame feature sequences. The method also includes processing the at least one video frame feature sequence by using a first neural network model, to obtain a feature representation result corresponding to the at least one video frame feature sequence, where the first neural network model is a recurrent neural network model; and processing the feature representation result corresponding to the at least one video frame feature sequence by using a second neural network model, to obtain a prediction result corresponding to the at least one video frame feature sequence.
  • In another aspect of the present disclosure, a non-transitory computer-readable storage medium is provided. The non-transitory computer-readable storage medium stores computer program instructions executable by at least one processor to perform: obtaining a to-be-processed video, the to-be-processed video having a plurality of video frames, and each video frame corresponding to one time feature; sampling the to-be-processed video according to a time-feature sampling rule, and obtaining at least one video frame feature sequence, the time-feature sampling rule being a correspondence between time features and video frame feature sequences; processing the at least one video frame feature sequence by using a first neural network model, to obtain a feature representation result corresponding to the at least one video frame feature sequence, the first neural network model being a recurrent neural network model; and processing the feature representation result corresponding to the at least one video frame feature sequence by using a second neural network model, to obtain a prediction result corresponding to the at least one video frame feature sequence.
  • In another aspect of the present disclosure, a server is provided. The server includes a memory storing computer program instructions; and a processor coupled to the memory. When executing the computer program instructions, the processor is configured to perform: obtaining a to-be-processed video, the to-be-processed video having a plurality of video frames, and each video frame corresponding to one time feature; sampling the to-be-processed video according to a time-feature sampling rule, and obtaining at least one video frame feature sequence, the time-feature sampling rule being a correspondence between time features and video frame feature sequences; processing the at least one video frame feature sequence by using a first neural network model, to obtain a feature representation result corresponding to the at least one video frame feature sequence, the first neural network model being a recurrent neural network model; and processing the feature representation result corresponding to the at least one video frame feature sequence by using a second neural network model, to obtain a prediction result corresponding to the at least one video frame feature sequence.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a schematic architectural diagram of information processing according to an embodiment of the present disclosure;
  • FIG. 2 is a schematic diagram of an information processing method according to an embodiment of the present disclosure;
  • FIG. 3 is a schematic diagram of a to-be-processed video according to an embodiment of the present disclosure;
  • FIG. 4 is a schematic diagram of a convolutional neural network having an inception structure according to an embodiment of the present disclosure;
  • FIG. 5 is a schematic structural diagram of a first neural network model according to an embodiment of the present disclosure;
  • FIG. 6 is a schematic structural diagram of a second neural network model according to an embodiment of the present disclosure;
  • FIG. 7 is a schematic diagram of a server according to an embodiment of the present disclosure;
  • FIG. 8 is a schematic diagram of another server according to an embodiment of the present disclosure;
  • FIG. 9 is a schematic diagram of another server according to an embodiment of the present disclosure;
  • FIG. 10 is a schematic diagram of another server according to an embodiment of the present disclosure;
  • FIG. 11 is a schematic diagram of another server according to an embodiment of the present disclosure;
  • FIG. 12 is a schematic diagram of another server according to an embodiment of the present disclosure;
  • FIG. 13 is a schematic diagram of another server according to an embodiment of the present disclosure;
  • FIG. 14 is a schematic diagram of another server according to an embodiment of the present disclosure; and
  • FIG. 15 is a schematic structural diagram of a server according to an embodiment of the present disclosure.
  • DETAILED DESCRIPTION
  • The embodiments of the present disclosure provide a video classification method, an information processing method, and a server. In the process of classifying a video, the feature change of the video in a time dimension is also considered, so that video content can be better represented, the accuracy of video classification is improved, and the effect of video classification is improved.
  • In the specification, claims, and accompanying drawings of the present disclosure, the terms “first”, “second”, “third”, “fourth”, and so on (if exists) are intended to distinguish similar objects but do not necessarily indicate a specific order or sequence. It is to be understood that, data used in this way is exchangeable in a proper case, so that the embodiments of the present disclosure that are described herein, for example, can be implemented in another order except those shown or described herein. Moreover, the terms “include”, “contain” and any other variants mean to cover the non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units is not necessarily limited to those steps or units that are expressly listed, but may include other steps or units not expressly listed or inherent to such a process, method, product, or device.
  • It is to be understood that the solutions in the present disclosure may be mainly used to provide a video content classification service. A backend computer device performs feature extraction, time sequence modeling, and feature compression on a video, and finally classifies video features by using a mixed expert model, so that automatic classification and labeling on the video are implemented on the computer device. Such solutions may be deployed on a video type website to add key words for videos on the video type website, and quick search and content matching can also be facilitated, and video personalized recommendation can also be facilitated.
  • FIG. 1 is a schematic architectural diagram of information processing according to an embodiment of the present disclosure. As shown in FIG. 1, first, a computer device obtains a to-be-processed video. It can be learned from FIG. 1 that the to-be-processed video includes a plurality of video frames, and each video frame corresponds to a time feature, and different time features may be represented by t. Next, the computer device processes each video frame in the to-be-processed video by using a convolutional neural network, to obtain a time feature corresponding to each video frame. Then, the computer device determines a time feature sequence of the to-be-processed video according to the time feature corresponding to each video frame. The time feature sequence is deep learning representation at the frame level.
  • Further, continuing to refer to FIG. 1, the computer device may sample the to-be-processed video according to a time-feature sampling rule. The time-feature sampling rule refers to sampling video features at different frame rates in a time dimension, and obtaining at least one video frame feature sequence. The video frame feature sequences correspond to different time scales. The computer device inputs video frame feature sequences corresponding to different time scales into bidirectional recurrent neural networks, respectively, to obtain a feature representation result corresponding to at least one video frame feature sequence. The feature representation result is video feature representation in a time scale. Finally, the computer device inputs all feature representation results into a second neural network, namely, the mixed expert model, and obtains a prediction result corresponding to each video frame feature sequence, and can determine a category of the to-be-processed video according to the prediction results, to classify the to-be-processed video.
  • In common video data, a user usually describes and comments video information and provides personalized label data, to form rich text information related to online videos. The text information may also be used as basis for video classification.
  • The following describes the information processing method in the present disclosure by using a server as an execution entity. It is to be understood that the information processing method in the present disclosure not only can be applied to the server, but also can be applied to any other computer device. Referring to FIG. 2, the information processing method in one embodiment of the present disclosure includes:
  • 101: Obtain a to-be-processed video, the to-be-processed video including a plurality of video frames, and each video frame corresponding to a time feature.
  • In one embodiment, the server first obtains the to-be-processed video. Specifically, refer to FIG. 3. FIG. 3 is a schematic diagram of the to-be-processed video in one embodiment of the present disclosure. The to-be-processed video includes a plurality of video frames. For example, each picture in FIG. 3 is a video frame, and each video frame corresponds to a time feature.
  • The to-be-processed video corresponds to a period of play time. Therefore, each video frame corresponds to a different play moment. Assuming that a time feature of a first video frame in the to-be-processed video is “1”, and a time feature of a second video frame is “2”, by this analogy, a time feature of a Tth video frame is “T”.
  • 102: Sample the to-be-processed video according to a time-feature sampling rule, and obtain at least one video frame feature sequence, the time-feature sampling rule being a correspondence between a time feature and a video frame feature sequence.
  • In one embodiment, the server samples the to-be-processed video according to the time-feature sampling rule. The time-feature sampling rule includes a preset relationship between a time feature and a video frame feature sequence. In actual application, one video frame feature sequence may be obtained, or video frame feature sequences of at least two different time scales may be obtained. For the video frame feature sequences corresponding to different time scales, the number of time features corresponding to each included video frame feature is different, and correspondingly, lengths of the video frame feature sequences corresponding to different time scales are also different.
  • For example, one to-be-processed video has a total of 1000 video frames, and the 1000 video frames respectively correspond to 1 to 1000 time features. If the time-feature sampling rule is that each time feature corresponds to one video frame feature, 1000 time features of the to-be-processed video correspond to 1000 video frame features. Correspondingly, the length of the video frame feature sequence formed by the 1000 video frame features is 1000. If the time-feature sampling rule is that every 100 time features correspond to one video frame feature, the 1000 time features of the to-be-processed video correspond to 10 video frame features. Correspondingly, the length of the video frame feature sequence formed by the 10 video frames is 10, and so on, and details are not described herein.
  • 103: Process the at least one video frame feature sequence by using a first neural network model, to obtain a feature representation result corresponding to the at least one video frame feature sequence, each video frame feature sequence corresponding to a feature representation result.
  • In one embodiment, after obtaining the at least one video frame feature sequence, the server may separately input video frame feature sequences corresponding to different time scales into the first neural network model. The first neural network model is a recurrent neural network model. Then the first neural network model recurses the input at least one video frame feature sequence, and correspondingly outputs a feature representation result of each video frame feature sequence.
  • Different time scales are different lengths of the video frame feature sequences. As described in Step 102, assuming that the total length of the video is T, if each time feature corresponds to a video frame feature, the length of the video frame feature sequence is T/1. If every 10 time-features correspond to a video frame feature, the length of the video frame feature sequence is T/10.
  • 104: Process the feature representation result corresponding to the at least one video frame feature sequence by using a second neural network model, to obtain a prediction result corresponding to the at least one video frame feature sequence, the prediction result being used to determine a category of the to-be-processed video. Each video frame feature sequence corresponds to a prediction result. In one embodiment, the server may separately input the feature representation result corresponding to each video frame feature sequence into the second neural network model, and then after processing each input feature representation result by using the second neural network model, the server outputs the prediction result corresponding to each feature representation result. Finally, the server may determine the category of the to-be-processed video according to the prediction result.
  • It may be understood that the category of the to-be-processed video may be “sports”, “news”, “music”, “animation”, “game”, or the like, and is not limited herein.
  • In one embodiment of the present disclosure, an information processing method is provided. First, the server obtains the to-be-processed video, the to-be-processed video including a plurality of video frames, and each video frame corresponding to a time feature, and then samples the to-be-processed video according to the time-feature sampling rule, and obtains the at least one video frame feature sequence, the time-feature sampling rule being the correspondence between the time feature and the video frame feature sequence. The server then inputs the at least one video frame feature sequence into the first neural network model, to obtain the feature representation result corresponding to the at least one video frame feature sequence, the first neural network model being a recurrent neural network model. Finally, the server inputs the feature representation result corresponding to the at least one video frame feature sequence into the second neural network model, to obtain the prediction result corresponding to each video frame feature sequence, the prediction result being used to determine the category of the to-be-processed video. In the foregoing manner, in a process of classifying a video, a feature change of the video in a time dimension is also considered, so that video content can be better represented, the accuracy of video classification is improved, and the effect of video classification is improved.
  • Optionally, after the process of obtaining a to-be-processed video, the method may further include: processing each video frame in the to-be-processed video by using a convolutional neural network CNN, to obtain a time feature corresponding to each video frame; and determining a time feature sequence of the to-be-processed video according to the time feature corresponding to each video frame, the time feature sequence being configured for sampling.
  • In one embodiment, after obtaining the to-be-processed video, the server may process each video frame in the to-be-processed video by using a convolutional neural network (CNN) having an inception structure, and then extract a time feature corresponding to each video frame. Finally, the server determines the time feature sequence of the to-be-processed video according to the time feature of each video frame. Assuming that the first video frame of the to-be-processed video is 1, and the second video frame is 2, by this analogy, the last video frame is T, it may be determined that the time feature sequence of the to-be-processed video is T (seconds).
  • A CNN having an inception structure is described below. FIG. 4 is a schematic diagram of a convolutional neural network having an inception structure in an embodiment of the present disclosure. As shown in FIG. 4, the inception structure includes convolutions of three different sizes, namely, a 1×1 convolutional layer, a 3×3 convolutional layer, and a 5×5 convolutional layer, and in a 3×3 maximum pooling layer, a final fully-connected layer is removed, and a global average pooling layer (the size of the picture is changed to 1×1) is used to replace the fully-connected layer.
  • To enhance a network capability, a network depth and a network width may be increased. However, to reduce overfitting, free parameters also need to be reduced. Therefore, in a same layer of the inception structure, there are three different convolution templates for convolving the 1×1 convolutional layer, the 3×3 convolutional layer, and the 5×5 convolutional layer. Feature extraction may be performed on the three convolution templates in different sizes, and the three convolution templates are also a mixed model, because the maximum pooling layer also has the function of feature extraction, and different from convolution, no overfitting is performed without parameters, and the maximum pooling layer is used as a branch. However, directly doing this causes a relatively large calculation amount of the entire network, and the layer is not deepened. Therefore, 1×1 convolution is first performed before 3×3 convolution and 5×5 convolution, to reduce the number of input channels, so that the network is deepened, and the calculation amount is reduced.
  • Secondly, in one embodiment of the present disclosure, after obtaining the to-be-processed video, the server may further process each video frame in the to-be-processed video by using a convolutional neural network, and obtain a time feature corresponding to each video frame. The time features are used to form a time feature sequence of the entire to-be-processed video. In the foregoing manner, each video frame is trained and processed by using the convolutional neural network, to facilitate improvement of the accuracy and effect of time feature extraction.
  • Optionally, based on the first embodiment corresponding to FIG. 2, in a second optional embodiment of the information processing method provided in one embodiment of the present disclosure, the sampling the to-be-processed video according to a time-feature sampling rule, and obtaining at least one video frame feature sequence may include: determining at least one time window according to the time-feature sampling rule, each time window including at least one video frame of the to-be-processed video; and extracting, from the time feature sequence, a video frame feature sequence corresponding to each time window.
  • In one embodiment, how the server obtains the at least one video frame feature sequence is described below.
  • Specifically, at least one time-window is defined according to the time-feature sampling rule first, to sample the video frame feature sequence in a multi-scale manner. It may be assumed that the to-be-processed video has a total of T seconds, which separately use one video frame, five video frames, and 10 video frames as a time window, and video frames in the time window are averaged to obtain video frame feature sequences in three different scales. If T seconds are equal to 100 frames, one frame is used as the time window, and the length of the video frame feature sequence is T/1=T. If 10 frames are used as the time window, the finally obtained length of the video frame feature sequence is T/10. Therefore, the length of the video frame feature sequence is related to the size of the time window.
  • The size of the time window may be predefined manually. A larger number of video frames in one time-window indicates a larger granularity. An averaging operation is performed on content in each time-window, so that the content becomes content of “one frame”.
  • Further, in one embodiment of the present disclosure, a method for extracting video frame feature sequences in different time scales is described. Namely, at least one time-window is first determined according to the time-feature sampling rule, and each time-window includes at least one video frame in the to-be-processed video, and then a video frame feature sequence corresponding to each time-window is extracted from the time feature sequence. In the foregoing manner, video frame feature sequences in different scales can be obtained, to obtain a plurality of different samples for feature training. In this way, the accuracy of a video classification result is improved.
  • Optionally, in one embodiment of the present disclosure, the processing the at least one video frame feature sequence by using a first neural network model, to obtain a feature representation result corresponding to each video frame feature sequence may include: inputting the at least one video frame feature sequence into a forward recurrent neural network in the first neural network model, to obtain a first representation result; inputting the at least one video frame feature sequence into a backward recurrent neural network in the first neural network model, to obtain a second representation result; and calculating a feature representation result corresponding to the at least one video frame feature sequence according to the first representation result and the second representation result.
  • In one embodiment, how to use the first neural network model to obtain the feature representation result corresponding to each video frame feature sequence is described below.
  • Specifically, FIG. 5 is a schematic structural diagram of a first neural network model in an embodiment of the present disclosure. As shown in FIG. 5, the entire first neural network model includes two parts, namely, a forward recurrent neural network and a backward recurrent neural network, and each video frame feature sequence is input into the forward recurrent neural network, and then a corresponding first representation result is outputted. Meanwhile, each video frame feature sequence is input into the backward recurrent neural network, and then a corresponding second representation result is outputted.
  • Finally, a feature representation result corresponding to the video frame feature sequence can be obtained by directly splicing the first representation result and the second representation result.
  • Further, in one embodiment of the present disclosure, based on extraction of the video frame feature sequence, time sequence modeling may be performed on the video frame feature sequence by using a recurrent gate unit based recurrent neural network. Further, to better perform feature representation on information of different time scales, in this solution, a first neural network model can also be used to perform video feature compression. In the foregoing manner, for the recurrent neural network, because main content of most videos occurs in the middle of video time, a bidirectional recurrent neural network is used to perform feature compression and representation respectively from forward and backward directions toward a time center point location of the to-be-processed video. In this way, the operability of the solution is improved.
  • Optionally, in one embodiment of the present disclosure, the calculating a feature representation result corresponding to the at least one video frame feature sequence according to the first representation result and the second representation result may include calculating the feature representation result corresponding to the at least one video frame feature sequence by using the following formulas:

  • h=[h T/2 f ,h T/2 b];

  • h t f =GRU(x t ,h t−1 f) for t∈[1,T/2];

  • h t b =GRU(x t ,h t+1 b) for t∈[1,T/2]
  • where h represents a feature representation result of a video frame feature sequence, hT/2 f represents the first representation result, hT/2 b represents the second representation result, xt represents the video frame feature sequence at a tth moment, GRU ( ) represents use of a gated recurrent unit GRU for neural network processing, T represents total time of the to-be-processed video, and t represents an integer in a range of 1 to T.
  • In one embodiment, a bidirectional recurrent neural network may be used to perform feature compression and representation respectively from forward and backward directions toward a video time center point location. Specifically, for a video frame feature sequence xt of a particular scale, t∈[1,T].
  • The forward recurrent neural network is:

  • h t f =GRU(x t ,h t−1 f) for t∈[1,T/2]
  • and the backward recurrent neural network is:

  • h t b =GRU(x t ,h t+1 b) for t∈[T,T/2]
  • where ht f is an intermediate layer feature representation in the forward recurrent neural network, or may be represented as a first representation result hT/2 f. ht b is an intermediate layer feature representation of the backward recurrent neural network, or may be represented as a second representation result hT/2 b. GRU ( ) is a recurrent gate unit function, and has a specific form of:

  • z tg(W z x t +U z h t−1 +b z)

  • r tg(W r x t +U r h t−1 +b r)

  • h t =z t ∘h t−1(1−z t)∘σh(W t x t +U h(r t ∘h t−1)+b h);
  • where σg represents a sigmoid function, and σh represents an arc-tangent function. In addition, Wz, Wr, Wt, Uz, Ur, and Uh are all linear transformation parameter matrices, and different subscripts respectively represent different “gates”, and bz, br, and bh are offset parameter vectors, and ° represents calculation of a compound function.
  • Therefore, the first representation result and the second representation result may be spliced, to obtain a feature representation result corresponding to a scale, namely,

  • h=[h T/2 f ,h T/2 b].
  • Further, in one embodiment of the present disclosure, how to calculate the feature representation result corresponding to each video frame feature sequence according to the first representation result and the second representation result is specifically described. In the foregoing manner, the prediction result can be obtained by calculation by using related formulas, to provide feasible manners for implementation of the solution, thereby improving the feasibility and operability of the solution.
  • Optionally, in one embodiment of the present disclosure, the processing the feature representation result corresponding to the at least one video frame feature sequence by using a second neural network model, to obtain a prediction result corresponding to the at least one video frame feature sequence may include: inputting the feature representation result corresponding to the at least one video frame feature sequence into a first sub-model in the second neural network model, to obtain a third representation result; inputting the feature representation result corresponding to the at least one video frame feature sequence into a second sub-model in the second neural network model, to obtain a fourth representation result; and calculating the prediction result corresponding to the at least one video frame feature sequence according to the third representation result and the fourth representation result.
  • In one embodiment, how to use the second neural network model to obtain the prediction result corresponding to each video frame feature sequence is described below.
  • Specifically, FIG. 6 is a schematic structural diagram of a second neural network model in an embodiment of the present disclosure. As shown in FIG. 6, the entire second neural network model includes two parts: respectively a first sub-model and a second sub-model. The first sub-model may also be referred to as “gate representation”, and the second sub-model may also be referred to as “activation representation”. A feature representation result corresponding to each video frame feature sequence is input to the “gate representation”, and then a corresponding third representation result is outputted. Meanwhile, a feature representation result corresponding to each video frame feature sequence is input to the “activation representation”, and then a corresponding fourth representation result is outputted.
  • Each third representation result is multiplied by each fourth representation result, and then addition is performed, to obtain a prediction result of the video frame feature sequence.
  • Secondly, in one embodiment of the present disclosure, after the feature representation result is obtained by using the first neural network model, the second neural network model may be further used to classify the feature representation result. In the foregoing manner, non-linear transformation may be performed on the feature representation result to obtain gate representation and activation representation respectively, and then a multiplication operation is performed on the two paths of representations and addition is performed, to obtain a final feature representation for classification, thereby facilitating improvement of the classification accuracy.
  • Optionally, in one embodiment of the present disclosure, the calculating the prediction result corresponding to the at least one video frame feature sequence according to the third representation result and the fourth representation result may include: calculating the prediction result corresponding to the at least one video frame feature sequence by using the following formulas:
  • lable = n = 1 N g n a n ; g n = σ g ( W g h + b g ) , n [ 1 , N ] ; a n = σ a ( W a h + b a ) , n [ 1 , N ] ;
  • where lable represents a prediction result of a video frame feature sequence, gn represents the third representation result, an represents the fourth representation result, σg represents a softmax function, σa represents a sigmoid function, h represents the feature representation result of the video frame feature sequence, Wg and bg represent parameters in the first sub-model, Wa and ba represent parameters of the second sub-model, N represents a calculation total number obtained after non-linear transformation is performed on the feature representation result, and n represents an integer in a range of 1 to N.
  • In one embodiment, how to use corresponding formulas to calculate the prediction result corresponding to each video frame feature sequence is specifically described below.
  • First, N paths of gate representation and activation representation are obtained by performing non-linear transformation on the feature representation result, and then a third representation result gn corresponding to the gate representation is calculated, and a fourth representation result an corresponding to the activation representation is calculated. The order in which the third representation result gn and the fourth representation result an are calculated is not limited herein.
  • After the two paths of representations are obtained, a multiplication operation is performed, and then an addition operation is performed, to obtain a prediction result of a video frame feature sequence.
  • Further, in one embodiment of the present disclosure, how to calculate the prediction result corresponding to each video frame feature sequence according to the third representation result and the fourth representation result is specifically described. In the foregoing manner, the prediction result can be obtained by calculation by using related formulas, to provide feasible manners for implementation of the solution, thereby improving the feasibility and operability of the solution.
  • Optionally, in one embodiment of the present disclosure, after the processing the feature representation result corresponding to the at least one video frame feature sequence by using a second neural network model, to obtain a prediction result corresponding to the at least one video frame feature sequence, the method may further include: calculating the category of the to-be-processed video according to the prediction result corresponding to the at least one video frame feature sequence and a weight value corresponding to the at least one video frame feature sequence; and classifying the to-be-processed video according to the category of the to-be-processed video.
  • In one embodiment, the server may further calculate the category of the to-be-processed video according to the prediction result corresponding to each video frame feature sequence, and a weight value corresponding to each video frame feature sequence, and classify the to-be-processed video according to a classification result.
  • Specifically, it is assumed that a maximum number of prediction results is five, and the prediction result is indicated by code “0 and 1” with a length of 5. For example, if the code of the prediction result 1 is 00001, and the code of the prediction result 3 is 00100, by this analogy, if a to-be-processed video includes both the prediction result 1 and the prediction result 3, the to-be-processed video is indicated as 00101.
  • However, for the entire to-be-processed video, a prediction result corresponding to each video frame feature sequence is obtained, and therefore each prediction result is not greater than 1, and the prediction result may indicate a possibility that the to-be-processed video belongs to the category. For example, {0.01, 0.02, 0.9, 0.005, 1.0} is a reasonable prediction result, and it means that a probability that the to-be-processed video belongs to the first category is 1.0, namely, 100%, a probability that the to-be-processed video belongs to the second category is 0.005, namely, 0.5%, a probability that the to-be-processed video belongs to the third category is 0.9, namely, 90%, a probability that the to-be-processed video belongs to the fourth category is 0.02, namely, 2%, and a probability that the to-be-processed video belongs to the fifth category is 0.01, namely, 1%.
  • In this case, the prediction result is calculated by using a preset weight value, and calculation may be performed by using a weighted algorithm. Each weight value is learned by using linear regression, and is a value, and indicates the importance of each video frame feature sequence, and a sum of weight values is 1, for example, {0.1, 0.4, 0.5}. How to calculate the category of the to-be-processed video is specifically described below.
  • If the weight value is {0.2, 0.3, 0.5}, the prediction result of the video frame feature sequence 1 is {0.01, 0.02, 0.9, 0.005, 1.0}, the prediction result of the video frame feature sequence 2 is {0.02, 0.01, 0.9, 0.000 0.9}, and the prediction result of the video frame feature sequence 3 is {0.2, 0.3, 0.8, 0.01 0.7}, the category of the to-be-processed video is indicated as:

  • 0.2×0.01+0.3×0.02+0.5×0.2,0.2×0.02+0.3×0.01+0.5×0.3,0.2×0.9+0.3×0.9+0.5×0.8,0.2×0.005+0.3×0.000+0.5×0.01,0.2×1.0+0.3×0.9+0.5×0.7={0.108,0.157,0.85,0.0075,0.82}
  • It can be understood from the result of the foregoing formula that, the probability that the to-be-processed video belongs to the third category is largest, and the probability that the to-be-processed video belongs to the first category is the second largest. Therefore, the to-be-processed video is displayed in a video list of the third category in priority.
  • Further, in one embodiment of the present disclosure, after obtaining the prediction result corresponding to each video frame feature sequence, the server may further calculate the category of the to-be-processed video according to the prediction result corresponding to each video frame feature sequence and the weight value corresponding to each video frame feature sequence, and finally classify the to-be-processed video according to the category of the to-be-processed video. In the foregoing manner, because the prediction result refers to the time feature, when the to-be-processed video is analyzed, the video classification capability can be improved, to implement personalized recommendation, and facilitate better practicability.
  • The following describes the server in the present disclosure in detail. FIG. 7 is a schematic diagram of an embodiment of a server in an embodiment of the present disclosure. As shown in FIG. 7, the server 20 includes: a first obtaining module 201, a second obtaining module 202, a first input module 203, and a second input module 204.
  • The first obtaining module 201 is configured to obtain a to-be-processed video, the to-be-processed video including a plurality of video frames, and each video frame corresponding to a time feature.
  • The second obtaining module 202 is configured to: sample, according to a time-feature sampling rule, the to-be-processed video obtained by the first obtaining module 201, and obtain at least one video frame feature sequence, the time-feature sampling rule being a correspondence between a time feature and a video frame feature sequence.
  • The first input module 203 is configured to process, by using a first neural network model, the at least one video frame feature sequence obtained by the second obtaining module 202, to obtain a feature representation result corresponding to the at least one video frame feature sequence, the first neural network model being a recurrent neural network model.
  • The second input module 204 is configured to process, by using a second neural network model, the feature representation result that corresponds to the at least one video frame feature sequence and that is obtained by the first input module 203, to obtain a prediction result corresponding to the at least one video frame feature sequence, the prediction result being used to determine a category of the to-be-processed video.
  • In one embodiment, the first obtaining module 201 obtains the to-be-processed video, the to-be-processed video including a plurality of video frames, and each video frame corresponding to a time feature; the second obtaining module 202 samples, according to the time-feature sampling rule, the to-be-processed video obtained by the first obtaining module 201, and obtains the at least one video frame feature sequence, the time-feature sampling rule being the correspondence between the time feature and the video frame feature sequence. The first input module 203 processes, by using the first neural network model, the at least one video frame feature sequence obtained by the second obtaining module 202, to obtain the feature representation result corresponding to the at least one video frame feature sequence, the first neural network model being a recurrent neural network model. The second input module 204 processes, by using the second neural network model, the feature representation result corresponding to the at least one video frame feature sequence obtained by the first input module 203, to obtain the prediction result corresponding to the at least one video frame feature sequence, the prediction result being used to determine the category of the to-be-processed video.
  • In an embodiment of the present disclosure, a server is provided. First, the server obtains the to-be-processed video, the to-be-processed video including a plurality of video frames, and each video frame corresponding to a time feature, and then samples the to-be-processed video according to the time-feature sampling rule, and obtains the at least one video frame feature sequence, the time-feature sampling rule being the correspondence between the time feature and the video frame feature sequence. The server then inputs the at least one video frame feature sequence into the first neural network model, to obtain the feature representation result corresponding to each video frame feature sequence. Finally, the server inputs the feature representation result corresponding to each video frame feature sequence into the second neural network model, to obtain the prediction result corresponding to each video frame feature sequence, the prediction result being used to determine the category of the to-be-processed video. In the foregoing manner, in a process of classifying a video, a feature change of the video in a time dimension is also considered, so that video content can be better represented, the accuracy of video classification is improved, and the effect of video classification is improved.
  • Optionally, in one embodiments of the present disclosure, the server 20 further includes a processing module 205, and a determining module 206.
  • The processing module 205 is configured to process each video frame in the to-be-processed video by using a convolutional neural network CNN, to obtain a time feature corresponding to each video frame after the first obtaining module 201 obtains the to-be-processed video.
  • The determining module 206 is configured to determine a time feature sequence of the to-be-processed video according to the time feature corresponding to each video frame processed by the processing module 205, the time feature sequence being configured for sampling.
  • In one embodiment of the present disclosure, after obtaining the to-be-processed video, the server may further process each video frame in the to-be-processed video by using a convolutional neural network, and obtain a time feature corresponding to each video frame. The time features are used to form a time feature sequence of the entire to-be-processed video. In the foregoing manner, each video frame is trained and processed by using the convolutional neural network, to facilitate improvement of the accuracy and effect of time feature extraction.
  • Optionally, referring to FIG. 9, in another embodiment of the server 20 provided in the embodiments of the present disclosure, the second obtaining module 202 includes: a determining unit 2021 configured to determine at least one time-window according to the time-feature sampling rule, each time-window including at least one video frame of the to-be-processed video; and an extraction unit 2022 configured to extract, from the time feature sequence, a video frame feature sequence corresponding to each time-window determined by the determining unit 2021.
  • Further, in one embodiment of the present disclosure, a method for extracting video frame feature sequences in different scales is described. That is, at least one time-window is first determined according to the time-feature sampling rule, and each time-window includes at least one video frame in the to-be-processed video, and then a video frame feature sequence corresponding to each time-window is extracted from the time feature sequence. In the foregoing manner, video frame feature sequences in different scales can be obtained, to obtain a plurality of different samples for feature training. In this way, the accuracy of a video classification result is improved.
  • Optionally, referring to FIG. 10, in another embodiment of the server 20 provided in the embodiments of the present disclosure, the first input module 203 includes: a first obtaining unit 2031, a second obtaining unit 2032, and a first calculation unit 2033.
  • The first obtaining unit 2031 is configured to input the at least one video frame feature sequence into a forward recurrent neural network in the first neural network model, to obtain a first representation result.
  • The second obtaining unit 2032 is configured to input each video frame feature sequence into a backward recurrent neural network in the first neural network model, to obtain a second representation result.
  • The first calculation unit 2033 is configured to calculate a feature representation result corresponding to the at least one video frame feature sequence according to the first representation result obtained by the first obtaining unit 2031 and the second representation result obtained by the second obtaining unit 2032.
  • Secondly, in one embodiment of the present disclosure, based on extraction of the video frame feature sequence, time sequence modeling may be performed on the video frame feature sequence by using a recurrent gate unit based recurrent neural network. Further, to better perform feature representation on information of different time scales, in this solution, a first neural network model can also be used to perform video feature compression. In the foregoing manner, for the recurrent neural network, because main content of most videos occurs in the middle of video time, a bidirectional recurrent neural network is used to perform feature compression and representation respectively from forward and backward directions toward a time center location of the to-be-processed video. In this way, the operability of the solution is improved.
  • Optionally, referring to FIG. 11, in another embodiment of the server 20 provided in the embodiments of the present disclosure, the first calculation unit 2033 includes: a first calculation subunit 20331, configured to calculate the feature representation result corresponding to the at least one video frame feature sequence by using the following formulas:

  • h=[h T/2 f ,h T/2 b];

  • h t f =GRU(x t ,h t−1 f) for t∈[1,T/2];

  • h t b =GRU(x t ,h t+1 b) for t∈[1,T/2]
  • where h represents a feature representation result of a video frame feature sequence, hT/2 f represents the first representation result, hT/2 b represents the second representation result, xt represents the video frame feature sequence at a tth moment, GRU ( ) represents use of a gated recurrent unit GRU for neural network processing, T represents total time of the to-be-processed video, and t represents an integer in a range of 1 to T.
  • Further, in one embodiment of the present disclosure, how to calculate the feature representation result corresponding to each video frame feature sequence according to the first representation result and the second representation result is specifically described. In the foregoing manner, the prediction result can be obtained by calculation by using related formulas, to provide feasible manners for implementation of the solution, thereby improving the feasibility and operability of the solution.
  • Optionally, referring to FIG. 12, in another embodiment of the server 20 provided in the embodiments of the present disclosure, the second input module 204 includes: a third obtaining unit 2041, a fourth obtaining unit 2042, and a second calculation unit 2043.
  • The third obtaining unit 2041 is configured to input the feature representation result corresponding to each video frame feature sequence into a first sub-model in the second neural network model, to obtain a third representation result.
  • The fourth obtaining unit 2042 is configured to input the feature representation result corresponding to each video frame feature sequence into a second sub-model in the second neural network model, to obtain a fourth representation result.
  • The second calculation unit 2043 is configured to calculate a prediction result corresponding to each video frame feature sequence according to the third representation result obtained by the third obtaining unit 2041 and the fourth representation result obtained by the fourth obtaining unit 2042.
  • Secondly, in one embodiment of the present disclosure, after the feature representation result is obtained by using the first neural network model, the second neural network model may be further used to classify the feature representation result. In the foregoing manner, non-linear transformation may be performed on the feature representation result to obtain gate representation and activation representation respectively, and then a multiplication operation is performed on the two paths of representations and addition is performed, to obtain a final feature representation for classification, thereby facilitating improvement of the classification accuracy.
  • Optionally, referring to FIG. 13, in another embodiment of the server 20 provided in the embodiments of the present disclosure, the second calculation unit 2043 includes a second calculation subunit 20431, configured to calculate the prediction result corresponding to each video frame feature sequence by using the following formulas:
  • lable = n = 1 N g n a n ; g n = σ g ( W g h + b g ) , n [ 1 , N ] ; a n = σ a ( W a h + b a ) , n [ 1 , N ] ;
  • where lable represents a prediction result of a video frame feature sequence, gn represents the third representation result, an represents the fourth representation result, cg represents a softmax function, σa represents a sigmoid function, h represents the feature representation result of the video frame feature sequence, Wg and bg represent parameters in the first sub-model, Wa and ba represent parameters of the second sub-model, N represents a calculation total number obtained after non-linear transformation is performed on the feature representation result, and n represents an integer in a range of 1 to N.
  • Further, in one embodiment of the present disclosure, how to calculate the prediction result corresponding to each video frame feature sequence according to the third representation result and the fourth representation result is specifically described. In the foregoing manner, the prediction result can be obtained by calculation by using related formulas, to provide feasible manners for implementation of the solution, thereby improving the feasibility and operability of the solution.
  • Optionally, referring to FIG. 14, in another embodiment of the server 20 provided in the embodiments of the present disclosure, the server 20 further includes a calculation module 207 and a classification module 208.
  • The calculation module 207 is configured to calculate the category of the to-be-processed video according to the prediction result corresponding to the at least one video frame feature sequence and a weight value corresponding to the at least one video frame feature sequence after the second input module 204 processes, by using the second neural network model, the feature representation result corresponding to the at least one video frame feature sequence, to obtain the prediction result corresponding to the at least one video frame feature sequence. The classification module 208 is configured to classify the to-be-processed video according to the category that is of the to-be-processed video and that is calculated by the calculation module 207.
  • Further, in one embodiment of the present disclosure, after obtaining the prediction result corresponding to each video frame feature sequence, the server may further calculate the category of the to-be-processed video according to the prediction result corresponding to each video frame feature sequence and the weight value corresponding to each video frame feature sequence, and finally classify the to-be-processed video according to the category of the to-be-processed video. In the foregoing manner, because the prediction result refers to the time feature, when the to-be-processed video is analyzed, the video classification capability can be improved, to implement personalized recommendation, and facilitate better practicability.
  • FIG. 15 is a schematic structural diagram of a server according to an embodiment of the present disclosure. The server 300 may vary greatly due to different configurations or performance, and may include one or more central processing units (CPU) 322 (for example, one or more processors) and a memory 332, and one or more storage media 330 (for example, one or more mass storage devices) that store an application program 342 or data 344. The memory 332 and the storage medium 330 may be transient storages or persistent storages. The program stored in the storage medium 330 may include one or more modules (which is not marked in the figure), and each module may include a series of instruction and operations to the server. Further, the central processing unit 322 may be configured to communicate with the storage medium 330, and perform, on the server 300, a series of instructions and operations in the storage medium 330.
  • The server 300 may further include one or more power supplies 326, one or more wired or wireless network interfaces 350, one or more input/output interfaces 358, and/or one or more operating systems 341, for example, Windows Server™, Mac OS X™, Unix™, Linux™, and FreeBSD™.
  • In one embodiment of the present disclosure, the CPU 322 included in the server has the following functions: obtaining a to-be-processed video, the to-be-processed video including a plurality of video frames, and each video frame corresponding to a time feature; sampling the to-be-processed video according to a time-feature sampling rule, and obtaining at least one video frame feature sequence, the time-feature sampling rule being a correspondence between a time feature and a video frame feature sequence; processing the at least one video frame feature sequence by using a first neural network model, to obtain a feature representation result corresponding to the at least one video frame feature sequence, the first neural network model being a recurrent neural network model; and processing the feature representation result corresponding to the at least one video frame feature sequence by using a second neural network model, to obtain a prediction result corresponding to the at least one video frame feature sequence, the prediction result being used to determine a category of the to-be-processed video.
  • Optionally, the CPU 322 is further configured to execute the following operations: processing each video frame in the to-be-processed video by using a convolutional neural network CNN, to obtain a time feature corresponding to each video frame; and determining a time feature sequence of the to-be-processed video according to the time feature corresponding to each video frame, the time feature sequence being configured for sampling.
  • Optionally, the CPU 322 is specifically configured to execute the following operations: determining at least one time-window according to the time-feature sampling rule, each time-window including at least one video frame of the to-be-processed video; and extracting, from the time feature sequence, a video frame feature sequence corresponding to each time-window.
  • Optionally, the CPU 322 is specifically configured to execute the following operations: inputting the at least one video frame feature sequence into a forward recurrent neural network in the first neural network model, to obtain a first representation result; inputting the at least one video frame feature sequence into a backward recurrent neural network in the first neural network model, to obtain a second representation result; and calculating a feature representation result corresponding to the at least one video frame feature sequence according to the first representation result and the second representation result.
  • Optionally, the CPU 322 is specifically configured to execute the following step: calculating the feature representation result corresponding to the at least one video frame feature sequence by using the following formulas:

  • h=[h T/2 f ,h T/2 b];

  • h t f =GRU(x t ,h t−1 f) for t∈[1,T/2];

  • h t b =GRU(x t ,h t+1 b) for t∈[1,T/2]
  • where h represents a feature representation result of a video frame feature sequence, hT/2 f represents the first representation result, hT/2 b represents the second representation result, xt represents the video frame feature sequence at a tth moment, GRU ( ) represents use of a gated recurrent unit GRU for neural network processing, T represents total time of the to-be-processed video, and t represents an integer in a range of 1 to T.
  • Optionally, the CPU 322 is specifically configured to execute the following operations: inputting the feature representation result corresponding to the at least one video frame feature sequence into a first sub-model in the second neural network model, to obtain a third representation result; inputting the feature representation result corresponding to the at least one video frame feature sequence into a second sub-model in the second neural network model, to obtain a fourth representation result; and calculating the prediction result corresponding to the at least one video frame feature sequence according to the third representation result and the fourth representation result.
  • Optionally, the CPU 322 is specifically configured to execute the following step: calculating the prediction result corresponding to the at least one video frame feature sequence by using the following formulas:
  • lable = n = 1 N g n a n ; g n = σ g ( W g h + b g ) , n [ 1 , N ] ; a n = σ a ( W a h + b a ) , n [ 1 , N ] ;
  • where lable represents a prediction result of a video frame feature sequence, gn represents the third representation result, an represents the fourth representation result, σg represents a softmax function, σa represents a sigmoid function, h represents the feature representation result of the video frame feature sequence, Wg and bg represent parameters in the first sub-model, Wa and ba represent parameters of the second sub-model, N represents a calculation total number obtained after non-linear transformation is performed on the feature representation result, and n represents an integer in a range of 1 to N.
  • Optionally, the CPU 322 is further configured to execute the following operations: calculating the category of the to-be-processed video according to the prediction result corresponding to the at least one video frame feature sequence and a weight value corresponding to the at least one video frame feature sequence; and classifying the to-be-processed video according to the category of the to-be-processed video.
  • An embodiment of the present disclosure further provides a storage medium, for storing program code, the program code being configured to execute any implementation of the information processing method according to the foregoing embodiments.
  • In the foregoing embodiments, implementation may be entirely or partially performed by using software, hardware, firmware or any combination thereof. When software is used for implementation, implementation may be entirely or partially performed in the form of a computer program product.
  • The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or some of the processes or functions according to the embodiments of the present disclosure are produced. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or another programmable apparatus. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium. For example, the computer instructions may be transmitted from one website, computer, server or data center to another website, computer, server or data center in a wired (for example, a coaxial cable, an optical fiber or a digital subscriber line (DSL)) or wireless (for example, infrared, wireless or microwave) manner. The computer readable storage medium may be any available medium capable of being accessed by a computer or include one or more data storage devices integrated by an available medium, such as a server and a data center. The available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium (such as a digital versatile disc (DVD)), a semiconductor medium (such as a solid-state disk (SSD)) or the like.
  • It may be understood by persons skilled in the art that for convenience and brevity of description, for a detailed working process of the foregoing system, apparatus, and unit, reference may be made to a corresponding process in the foregoing method embodiments, and details are not described herein again.
  • In the several embodiments provided in the present disclosure, it is to be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the described apparatus embodiment is merely an example. For example, the unit division is merely logical function division and may be other division during actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented by using some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.
  • The units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some of or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit. The integrated units may be implemented in a form of hardware or may be implemented in a form of a software functional unit.
  • When the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, the integrated unit may be stored in a computer-readable storage medium. Based on such an understanding, the technical solutions of the present disclosure essentially, or the part contributing to the prior art, or all or some of the technical solutions may be implemented in a form of a software product. The computer software product is stored in a storage medium and includes several instructions for instructing a computer device (which may be a personal computer, a server, a network device, or the like) to perform all or some of the steps of the methods described in the embodiments of the present disclosure. The foregoing storage medium includes: any medium that can store program code, such as a USB flash memory drive, a removable hard disk, a read-only memory (ROM), a random-access memory (RAM), a magnetic disk, or an optical disc.
  • In conclusion, the foregoing embodiments are merely intended for describing the technical solutions of the present disclosure, but not for limiting the present disclosure. Although the present disclosure is described in detail with reference to the foregoing embodiments, persons of ordinary skill in the art should understand that they may still make modifications to the technical solutions described in the foregoing embodiments or make equivalent replacements to some technical features thereof, without departing from the spirit and scope of the technical solutions of the embodiments of the present disclosure.

Claims (20)

What is claimed is:
1. A video classification method for a computer device, comprising:
obtaining a to-be-processed video, the to-be-processed video having a plurality of video frames, and each video frame corresponding to one time feature;
sampling the to-be-processed video according to a time-feature sampling rule, and obtaining at least one video frame feature sequence, the time-feature sampling rule being a correspondence between time features and video frame feature sequences;
processing the at least one video frame feature sequence by using a first neural network model, to obtain a feature representation result corresponding to the at least one video frame feature sequence, the first neural network model being a recurrent neural network model; and
processing the feature representation result corresponding to the at least one video frame feature sequence by using a second neural network model, to obtain a prediction result corresponding to the at least one video frame feature sequence.
2. The method according to claim 1, further comprising:
determining a category of the to-be-processed video according to the prediction result corresponding to the at least one video frame feature sequence.
3. The method according to claim 2, wherein, after the obtaining a to-be-processed video, the method further comprises:
processing each video frame in the to-be-processed video by using a convolutional neural network CNN, to obtain a time feature corresponding to each video frame; and
determining the at least one time feature sequence of the to-be-processed video according to the time feature corresponding to each video frame, the time feature sequence being configured for sampling.
4. The method according to claim 3, wherein the sampling the to-be-processed video according to a time-feature sampling rule, and obtaining at least one video frame feature sequence comprises:
determining at least one time-window according to the time-feature sampling rule, each time-window comprising at least one video frame of the to-be-processed video; and
extracting, from the time feature sequence, a video frame feature sequence corresponding to each time-window.
5. The method according to claim 2, wherein the processing the at least one video frame feature sequence by using a first neural network model, to obtain a feature representation result corresponding to each video frame feature sequence comprises:
inputting the at least one video frame feature sequence into a forward recurrent neural network in the first neural network model, to obtain a first representation result;
inputting the at least one video frame feature sequence into a backward recurrent neural network in the first neural network model, to obtain a second representation result; and
calculating the feature representation result corresponding to the at least one video frame feature sequence according to the first representation result and the second representation result.
6. The method according to claim 5, wherein the calculating the feature representation result corresponding to the at least one video frame feature sequence according to the first representation result and the second representation result comprises:
calculating the feature representation result corresponding to the at least one video frame feature sequence by using the following formulas:

h=[h T/2 f ,h T/2 b];

h t f =GRU(x t ,h t−1 f) for t∈[1,T/2];

h t b =GRU(x t ,h t+1 b) for t∈[T,T/2]
wherein h represents a feature representation result of a video frame feature sequence, hT/2 f represents the first representation result, hT/2 b represents the second representation result, xt represents the video frame feature sequence at a tth moment, GRU ( ) represents use of a gated recurrent unit GRU for neural network processing, T represents total time of the to-be-processed video, and t represents an integer in a range of 1 to T.
7. The method according to claim 2, wherein the processing the feature representation result corresponding to the at least one video frame feature sequence by using a second neural network model, to obtain a prediction result corresponding to the at least one video frame feature sequence comprises:
inputting the feature representation result corresponding to the at least one video frame feature sequence into a first sub-model in the second neural network model, to obtain a third representation result;
inputting the feature representation result corresponding to the at least one video frame feature sequence into a second sub-model in the second neural network model, to obtain a fourth representation result; and
calculating the prediction result corresponding to the at least one video frame feature sequence according to the third representation result and the fourth representation result.
8. The method according to claim 7, wherein the calculating the prediction result corresponding to the at least one video frame feature sequence according to the third representation result and the fourth representation result comprises:
calculating the prediction result corresponding to the at least one video frame feature sequence by using the following formulas:
lable = n = 1 N g n a n ; g n = σ g ( W g h + b g ) , n [ 1 , N ] ; a n = σ a ( W a h + b a ) , n [ 1 , N ] ;
wherein lable represents a prediction result of a video frame feature sequence, gn represents the third representation result, an represents the fourth representation result, σg represents a softmax function, σa represents a sigmoid function, h represents the feature representation result of the video frame feature sequence, Wg and bg represent parameters in the first sub-model, Wa and ba represent parameters of the second sub-model, N represents a calculation total number obtained after non-linear transformation is performed on the feature representation result, and n represents an integer in a range of 1 to N.
9. The method according to claim 2, wherein after the processing the feature representation result corresponding to the at least one video frame feature sequence by using a second neural network model, to obtain a prediction result corresponding to the at least one video frame feature sequence, the method further comprises:
calculating the category of the to-be-processed video according to the prediction result corresponding to the at least one video frame feature sequence and a weight value corresponding to the at least one video frame feature sequence; and
classifying the to-be-processed video according to the category of the to-be-processed video.
10. A non-transitory computer-readable storage medium storing computer program instructions executable by at least one processor to perform:
obtaining a to-be-processed video, the to-be-processed video having a plurality of video frames, and each video frame corresponding to one time feature;
sampling the to-be-processed video according to a time-feature sampling rule, and obtaining at least one video frame feature sequence, the time-feature sampling rule being a correspondence between time features and video frame feature sequences;
processing the at least one video frame feature sequence by using a first neural network model, to obtain a feature representation result corresponding to the at least one video frame feature sequence, the first neural network model being a recurrent neural network model; and
processing the feature representation result corresponding to the at least one video frame feature sequence by using a second neural network model, to obtain a prediction result corresponding to the at least one video frame feature sequence.
11. The non-transitory computer-readable storage medium according to claim 10, wherein the computer program instructions are executable by the processor to further perform:
determining a category of the to-be-processed video according to the prediction result corresponding to the at least one video frame feature sequence.
12. The non-transitory computer-readable storage medium according to claim 11, wherein the computer program instructions are executable by the processor to, after the obtaining a to-be-processed video, further perform:
processing each video frame in the to-be-processed video by using a convolutional neural network CNN, to obtain a time feature corresponding to each video frame; and
determining the at least one time feature sequence of the to-be-processed video according to the time feature corresponding to each video frame, the time feature sequence being configured for sampling.
13. The non-transitory computer-readable storage medium according to claim 12, wherein the sampling the to-be-processed video according to a time-feature sampling rule, and obtaining at least one video frame feature sequence comprises:
determining at least one time-window according to the time-feature sampling rule, each time-window comprising at least one video frame of the to-be-processed video; and
extracting, from the time feature sequence, a video frame feature sequence corresponding to each time-window.
14. The non-transitory computer-readable storage medium according to claim 11, wherein the processing the at least one video frame feature sequence by using a first neural network model, to obtain a feature representation result corresponding to each video frame feature sequence comprises:
inputting the at least one video frame feature sequence into a forward recurrent neural network in the first neural network model, to obtain a first representation result;
inputting the at least one video frame feature sequence into a backward recurrent neural network in the first neural network model, to obtain a second representation result; and
calculating the feature representation result corresponding to the at least one video frame feature sequence according to the first representation result and the second representation result.
15. The non-transitory computer-readable storage medium according to claim 14, wherein the calculating the feature representation result corresponding to the at least one video frame feature sequence according to the first representation result and the second representation result comprises:
calculating the feature representation result corresponding to the at least one video frame feature sequence by using the following formulas:

h=[h T/2 f ,h T/2 b];

h t f =GRU(x t ,h t−1 f) for t∈[1,T/2];

h t b =GRU(x t ,h t+1 b) for t∈[1,T/2]
wherein h represents a feature representation result of a video frame feature sequence, hT/2 f represents the first representation result, hT/2 b represents the second representation result, xt represents the video frame feature sequence at a tth moment, GRU ( ) represents use of a gated recurrent unit GRU for neural network processing, T represents total time of the to-be-processed video, and t represents an integer in a range of 1 to T.
16. The non-transitory computer-readable storage medium according to claim 11, wherein the processing the feature representation result corresponding to the at least one video frame feature sequence by using a second neural network model, to obtain a prediction result corresponding to the at least one video frame feature sequence comprises:
inputting the feature representation result corresponding to the at least one video frame feature sequence into a first sub-model in the second neural network model, to obtain a third representation result;
inputting the feature representation result corresponding to the at least one video frame feature sequence into a second sub-model in the second neural network model, to obtain a fourth representation result; and
calculating the prediction result corresponding to the at least one video frame feature sequence according to the third representation result and the fourth representation result.
17. The non-transitory computer-readable storage medium according to claim 16, wherein the calculating the prediction result corresponding to the at least one video frame feature sequence according to the third representation result and the fourth representation result comprises:
calculating the prediction result corresponding to the at least one video frame feature sequence by using the following formulas:
lable = n = 1 N g n a n ; g n = σ g ( W g h + b g ) , n [ 1 , N ] ; a n = σ a ( W a h + b a ) , n [ 1 , N ] ;
wherein lable represents a prediction result of a video frame feature sequence, gn represents the third representation result, an represents the fourth representation result, σg represents a softmax function, σa represents a sigmoid function, h represents the feature representation result of the video frame feature sequence, Wg and bg represent parameters in the first sub-model, Wa and ba represent parameters of the second sub-model, N represents a calculation total number obtained after non-linear transformation is performed on the feature representation result, and n represents an integer in a range of 1 to N.
18. The non-transitory computer-readable storage medium according to claim 11, wherein after the processing the feature representation result corresponding to the at least one video frame feature sequence by using a second neural network model, to obtain a prediction result corresponding to the at least one video frame feature sequence, the computer program instructions are executable by the processor to further perform:
calculating the category of the to-be-processed video according to the prediction result corresponding to the at least one video frame feature sequence and a weight value corresponding to the at least one video frame feature sequence; and
classifying the to-be-processed video according to the category of the to-be-processed video.
19. A server, comprising:
a memory storing computer program instructions; and
a processor coupled to the memory and, when executing the computer program instructions, configured to perform:
obtaining a to-be-processed video, the to-be-processed video having a plurality of video frames, and each video frame corresponding to one time feature;
sampling the to-be-processed video according to a time-feature sampling rule, and obtaining at least one video frame feature sequence, the time-feature sampling rule being a correspondence between time features and video frame feature sequences;
processing the at least one video frame feature sequence by using a first neural network model, to obtain a feature representation result corresponding to the at least one video frame feature sequence, the first neural network model being a recurrent neural network model; and
processing the feature representation result corresponding to the at least one video frame feature sequence by using a second neural network model, to obtain a prediction result corresponding to the at least one video frame feature sequence.
20. The server according to claim 19, wherein the processor is further configured to perform:
determining a category of the to-be-processed video according to the prediction result corresponding to the at least one video frame feature sequence.
US16/558,015 2017-09-15 2019-08-30 Video classification method, information processing method, and server Active 2038-10-20 US10956748B2 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201710833668.8 2017-09-15
CN201710833668.8A CN109508584B (en) 2017-09-15 2017-09-15 Video classification method, information processing method and server
PCT/CN2018/100733 WO2019052301A1 (en) 2017-09-15 2018-08-16 Video classification method, information processing method and server

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/100733 Continuation WO2019052301A1 (en) 2017-09-15 2018-08-16 Video classification method, information processing method and server

Publications (2)

Publication Number Publication Date
US20190384985A1 true US20190384985A1 (en) 2019-12-19
US10956748B2 US10956748B2 (en) 2021-03-23

Family

ID=65723493

Family Applications (1)

Application Number Title Priority Date Filing Date
US16/558,015 Active 2038-10-20 US10956748B2 (en) 2017-09-15 2019-08-30 Video classification method, information processing method, and server

Country Status (7)

Country Link
US (1) US10956748B2 (en)
EP (1) EP3683723A4 (en)
JP (1) JP7127120B2 (en)
KR (1) KR102392943B1 (en)
CN (2) CN109508584B (en)
MA (1) MA50252A (en)
WO (1) WO2019052301A1 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111104930A (en) * 2019-12-31 2020-05-05 腾讯科技(深圳)有限公司 Video processing method and device, electronic equipment and storage medium
US20200160064A1 (en) * 2018-11-21 2020-05-21 Facebook, Inc. Anticipating Future Video Based on Present Video
CN111209439A (en) * 2020-01-10 2020-05-29 北京百度网讯科技有限公司 Video clip retrieval method, device, electronic equipment and storage medium
CN111209883A (en) * 2020-01-13 2020-05-29 南京大学 Time sequence self-adaptive video classification method based on multi-source motion feature fusion
CN111259779A (en) * 2020-01-13 2020-06-09 南京大学 Video motion detection method based on central point trajectory prediction
CN113204992A (en) * 2021-03-26 2021-08-03 北京达佳互联信息技术有限公司 Video quality determination method and device, storage medium and electronic equipment
CN113204655A (en) * 2021-07-02 2021-08-03 北京搜狐新媒体信息技术有限公司 Multimedia information recommendation method, related device and computer storage medium
CN113779472A (en) * 2021-07-30 2021-12-10 阿里巴巴(中国)有限公司 Content auditing method and device and electronic equipment
US20220108184A1 (en) * 2020-10-02 2022-04-07 Robert Bosch Gmbh Method and device for training a machine learning system
CN114443896A (en) * 2022-01-25 2022-05-06 百度在线网络技术(北京)有限公司 Data processing method and method for training a predictive model
US11354906B2 (en) * 2020-04-13 2022-06-07 Adobe Inc. Temporally distributed neural networks for video semantic segmentation
CN114611584A (en) * 2022-02-21 2022-06-10 上海市胸科医院 CP-EBUS elastic mode video processing method, device, equipment and medium

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7352369B2 (en) * 2019-03-29 2023-09-28 株式会社日立システムズ Predictive model evaluation system, predictive model evaluation method
CN111782734B (en) * 2019-04-04 2024-04-12 华为技术服务有限公司 Data compression and decompression method and device
CN110162669B (en) * 2019-04-04 2021-07-02 腾讯科技(深圳)有限公司 Video classification processing method and device, computer equipment and storage medium
KR102255312B1 (en) * 2019-06-07 2021-05-25 국방과학연구소 Codec classification system using recurrent neural network and methods thereof
CN110263216B (en) * 2019-06-13 2022-01-28 腾讯科技(深圳)有限公司 Video classification method, video classification model training method and device
CN113010735B (en) * 2019-12-20 2024-03-08 北京金山云网络技术有限公司 Video classification method and device, electronic equipment and storage medium
CN111144508A (en) * 2019-12-30 2020-05-12 中国矿业大学(北京) Automatic control system and control method for coal mine auxiliary shaft rail transportation
CN111190600B (en) * 2019-12-31 2023-09-19 中国银行股份有限公司 Method and system for automatically generating front-end codes based on GRU attention model
CN111428660B (en) * 2020-03-27 2023-04-07 腾讯科技(深圳)有限公司 Video editing method and device, storage medium and electronic device
CN111489378B (en) * 2020-06-28 2020-10-16 腾讯科技(深圳)有限公司 Video frame feature extraction method and device, computer equipment and storage medium
CN111737521B (en) * 2020-08-04 2020-11-24 北京微播易科技股份有限公司 Video classification method and device
CN113349791B (en) * 2021-05-31 2024-07-16 平安科技(深圳)有限公司 Abnormal electrocardiosignal detection method, device, equipment and medium
KR102430989B1 (en) 2021-10-19 2022-08-11 주식회사 노티플러스 Method, device and system for predicting content category based on artificial intelligence

Family Cites Families (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100656373B1 (en) 2005-12-09 2006-12-11 한국전자통신연구원 Method for discriminating obscene video using priority and classification-policy in time interval and apparatus thereof
US8990132B2 (en) * 2010-01-19 2015-03-24 James Ting-Ho Lo Artificial neural networks based on a low-order model of biological neural networks
CN103544498B (en) * 2013-09-25 2017-02-08 华中科技大学 Video content detection method and video content detection system based on self-adaption sampling
CN104331442A (en) * 2014-10-24 2015-02-04 华为技术有限公司 Video classification method and device
US10762894B2 (en) * 2015-03-27 2020-09-01 Google Llc Convolutional neural networks
JP6556509B2 (en) 2015-06-16 2019-08-07 Cyberdyne株式会社 Photoacoustic imaging apparatus and light source unit
CN104951965B (en) * 2015-06-26 2017-04-19 深圳市腾讯计算机系统有限公司 Advertisement delivery method and device
CN104966104B (en) * 2015-06-30 2018-05-11 山东管理学院 A kind of video classification methods based on Three dimensional convolution neutral net
US9697833B2 (en) * 2015-08-25 2017-07-04 Nuance Communications, Inc. Audio-visual speech recognition with scattering operators
CN106503723A (en) * 2015-09-06 2017-03-15 华为技术有限公司 A kind of video classification methods and device
CN105550699B (en) * 2015-12-08 2019-02-12 北京工业大学 A kind of video identification classification method based on CNN fusion space-time remarkable information
JP6517681B2 (en) 2015-12-17 2019-05-22 日本電信電話株式会社 Image pattern learning apparatus, method and program
US11055537B2 (en) * 2016-04-26 2021-07-06 Disney Enterprises, Inc. Systems and methods for determining actions depicted in media contents based on attention weights of media content frames
CN106131627B (en) * 2016-07-07 2019-03-26 腾讯科技(深圳)有限公司 A kind of method for processing video frequency, apparatus and system
US10402697B2 (en) * 2016-08-01 2019-09-03 Nvidia Corporation Fusing multilayer and multimodal deep neural networks for video classification
CN106779467A (en) * 2016-12-31 2017-05-31 成都数联铭品科技有限公司 Enterprises ' industry categorizing system based on automatic information screening
US11263525B2 (en) * 2017-10-26 2022-03-01 Nvidia Corporation Progressive modification of neural networks
US10334202B1 (en) * 2018-02-28 2019-06-25 Adobe Inc. Ambient audio generation based on visual information
US20190286990A1 (en) * 2018-03-19 2019-09-19 AI Certain, Inc. Deep Learning Apparatus and Method for Predictive Analysis, Classification, and Feature Detection
US10860858B2 (en) * 2018-06-15 2020-12-08 Adobe Inc. Utilizing a trained multi-modal combination model for content and text-based evaluation and distribution of digital video content to client devices
US10418957B1 (en) * 2018-06-29 2019-09-17 Amazon Technologies, Inc. Audio event detection
US10699129B1 (en) * 2019-11-15 2020-06-30 Fudan University System and method for video captioning

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200160064A1 (en) * 2018-11-21 2020-05-21 Facebook, Inc. Anticipating Future Video Based on Present Video
US11636681B2 (en) * 2018-11-21 2023-04-25 Meta Platforms, Inc. Anticipating future video based on present video
CN111104930A (en) * 2019-12-31 2020-05-05 腾讯科技(深圳)有限公司 Video processing method and device, electronic equipment and storage medium
CN111209439A (en) * 2020-01-10 2020-05-29 北京百度网讯科技有限公司 Video clip retrieval method, device, electronic equipment and storage medium
CN111209883A (en) * 2020-01-13 2020-05-29 南京大学 Time sequence self-adaptive video classification method based on multi-source motion feature fusion
CN111259779A (en) * 2020-01-13 2020-06-09 南京大学 Video motion detection method based on central point trajectory prediction
US20220270370A1 (en) * 2020-04-13 2022-08-25 Adobe Inc. Temporally distributed neural networks for video semantic segmentation
US11854206B2 (en) * 2020-04-13 2023-12-26 Adobe Inc. Temporally distributed neural networks for video semantic segmentation
US11354906B2 (en) * 2020-04-13 2022-06-07 Adobe Inc. Temporally distributed neural networks for video semantic segmentation
US20220108184A1 (en) * 2020-10-02 2022-04-07 Robert Bosch Gmbh Method and device for training a machine learning system
CN113204992A (en) * 2021-03-26 2021-08-03 北京达佳互联信息技术有限公司 Video quality determination method and device, storage medium and electronic equipment
CN113204655A (en) * 2021-07-02 2021-08-03 北京搜狐新媒体信息技术有限公司 Multimedia information recommendation method, related device and computer storage medium
CN113779472A (en) * 2021-07-30 2021-12-10 阿里巴巴(中国)有限公司 Content auditing method and device and electronic equipment
CN114443896A (en) * 2022-01-25 2022-05-06 百度在线网络技术(北京)有限公司 Data processing method and method for training a predictive model
CN114611584A (en) * 2022-02-21 2022-06-10 上海市胸科医院 CP-EBUS elastic mode video processing method, device, equipment and medium

Also Published As

Publication number Publication date
JP2020533709A (en) 2020-11-19
US10956748B2 (en) 2021-03-23
CN109508584B (en) 2022-12-02
CN109508584A (en) 2019-03-22
EP3683723A4 (en) 2021-06-23
CN110532996A (en) 2019-12-03
KR102392943B1 (en) 2022-04-29
JP7127120B2 (en) 2022-08-29
CN110532996B (en) 2021-01-22
MA50252A (en) 2020-07-22
WO2019052301A1 (en) 2019-03-21
EP3683723A1 (en) 2020-07-22
KR20190133040A (en) 2019-11-29

Similar Documents

Publication Publication Date Title
US10956748B2 (en) Video classification method, information processing method, and server
CN109919078B (en) Video sequence selection method, model training method and device
CN110347873B (en) Video classification method and device, electronic equipment and storage medium
CN113378784B (en) Training method of video label recommendation model and method for determining video label
US20210240761A1 (en) Method and device for cross-modal information retrieval, and storage medium
US11741711B2 (en) Video classification method and server
WO2020253127A1 (en) Facial feature extraction model training method and apparatus, facial feature extraction method and apparatus, device, and storage medium
CN110083729B (en) Image searching method and system
US10445586B2 (en) Deep learning on image frames to generate a summary
CN113469298B (en) Model training method and resource recommendation method
CN112559800A (en) Method, apparatus, electronic device, medium, and product for processing video
EP4343616A1 (en) Image classification method, model training method, device, storage medium, and computer program
CN114186097A (en) Method and apparatus for training a model
WO2024040869A1 (en) Multi-task model training method, information recommendation method, apparatus, and device
WO2024012289A1 (en) Video generation method and apparatus, electronic device and medium
CN111898658B (en) Image classification method and device and electronic equipment
US20240144656A1 (en) Method, apparatus, device and medium for image processing
US11610606B1 (en) Retiming digital videos utilizing machine learning and temporally varying speeds
US20240029416A1 (en) Method, device, and computer program product for image processing
Kalkhorani et al. Beyond the Frame: Single and mutilple video summarization method with user-defined length
CN116524271A (en) Convolutional neural network training method, device and equipment
CN113438509A (en) Video abstract generation method, device and storage medium
CN116866669A (en) Video recommendation method, apparatus and computer program product
CN118230015A (en) Model construction method, device, electronic equipment and computer readable medium

Legal Events

Date Code Title Description
AS Assignment

Owner name: TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED, CHI

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:TANG, YONGYI;MA, LIN;LIU, WEI;SIGNING DATES FROM 20190814 TO 20190819;REEL/FRAME:050227/0808

Owner name: TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED, CHINA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:TANG, YONGYI;MA, LIN;LIU, WEI;SIGNING DATES FROM 20190814 TO 20190819;REEL/FRAME:050227/0808

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4